Wednesday, June 3, 2009

One thing that raises red flags to me is Java code that converts between Strings and byte arrays without explicitly specifying the character encoding, which means the system default character encoding is used. It often becomes a source of bugs when non-ASCII characters are used.

In one case, I tracked one such bug down to code in the old version of Apache Axis that we were using. Ironically, a version of Apache Axis released just weeks after the one that we were using was released had fixed that particular bug by explicitly specifying UTF-8 encoding. It was also a bug that couldn't be reproduced in development boxes, because the system default character encoding was UTF-8. I suggested fixing it in production by specifying the environmental variable LANG=en_US.UTF-8 to match development, but I think that suggestion was ignored. The issue came up again months later, and I suggested setting the LANG again. I don't know what happened after that.

There are various places that default to the system default character encoding, not just String.getBytes() or the String constructor from a byte array. There are also java.io.InputStreamReader, java.io.PrintStream, java.net.URLEncoder, java.net.URLDecoder, and so on.

Of course, for some code, the character encoding doesn't really matter, such as writing to log files, so the system default can be used. But, for example, when implementing SOAP, where UTF-8 is the specified character encoding, one can't use the system default, which may or may not be UTF-8.

One thing I find annoying about always specifying the character encoding, though, is always having to deal with the UnsupportedEncodingExceptions. Java guarantees that UTF-8 and a few others are always supported, yet I always have to write a catch block for this exception that should never get thrown. I think String.getBytes() and all those other String/byte conversion methods should take a java.nio.charset.Charset, instead of a String argument that names the encoding, so that those methods do not have to throw UnsupportedEncodingException. Furthermore, it would be nice to have a number of static fields should be added to java.nio.charset.Charset: Charset.US_ASCII, Charset.UTF_8, etc, of type Charset, for each of the required Charsets, though Charset.availableCharsets().get("UTF-8") etc, is good enough for doing away with the UnsupportedEncodingExceptions.

No comments:

Post a Comment