“A nearly impenetrable thicket of geekitude…”

Ant fixcrlf and UTF-8 on Windows

I’ve been working on a large XML processing system in which a sequence of steps implemented in Java and other technologies are orchestrated using Apache Ant. It has to run on Mac OS, Linux and Windows. It has been pretty stable for some time, but I recently set up a new Windows system and started seeing errors like this:

Exception in thread "main" org.xml.sax.SAXParseException:
    Invalid byte 3 of 3-byte UTF-8 sequence.

Peeking at the content at various points in the process, it became clear that the problem was that something was corrupting a particular UTF-8 encoded character. Early in the sequence the encoding looked like this:

E2 80 9D

This corresponds to U+201D RIGHT DOUBLE QUOTATION MARK. Later in the sequence, this had become:

E2 80 3F

This isn’t valid, as all bytes in a multi-byte UTF-8 sequence must have the top bit set.

So what happened? Well, this is Windows, so experience tells us it’s probably something to do with Code Page 1252. That suspicion is given extra strength when you observe that character position 9F is undefined in CP1252, and the character it has been mapped to is 3F, ‘?’. In the end, the corruption turned out to be coming from this Ant task:

<!-- Force the output file to use Unix line endings -->
<fixcrlf file="${xml.dir}/@{o}" eol="lf"/>

The fixcrlf task’s definition includes an optional encoding attribute which, if not set, “defaults to default JVM encoding”. Fixing the issue is therefore as simple as adding an appropriate encoding:

 <fixcrlf file="${xml.dir}/@{o}" eol="lf" encoding="UTF-8"/>

Why is this necessary on some Windows systems but not others? Life is full of mysteries.