Are Default Character Sets Harmful?
A friend and adviser of mine once pointed out that certain classes of bugs can be eliminated by changing the way you write code. When you recognize a programming practice or pattern that repeatedly causes bugs, find a way of accomplishing the same thing without the bugs, and make the better way your habit. Easier said than done, but the first step is recognizing you have a problem. To that end, there's a Java "feature" that keeps biting me over and over again: the default character set:
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
The default character set is the quick and dirty portal between byte-land and character-land. This portal is further obscured by convenience methods that hide it entirely from view. Some examples:
These methods are great for hacking out some prototype code or when working on a small project that's never going to leave a very confined environment (where you can be certain what the default character set will be). One of the long held knocks against Java is its verbosity; and, while I do not buy into those knocks personally, I can see that having these convenience methods alleviate some of the naysayers' pain.
However, there is a down side to the convenience methods. They obscure your intent and often introduce latent bugs. When you omit the character set from your code, you fail to convey to the maintainers and code reviewers of the code whether you:
- are ignorant of character sets.
- understand character sets, but are opting to ignore it in this case.
- considered the impact of character sets and correctly or incorrectly decided that the default character set is appropriate for the case at hand.
When your code makes it out of the safety of your controlled environment, you can assume that it will end up on a machine with a default character set different than the ones you tested. Maybe those are explicitly or implicitly unsupported configurations, in which case the bug is just technical debt. If not, then some day Murphy will come calling and you will be fixing the bug. Or maybe someone else will be fixing it and cursing you the entire time. Either way, not a good day for you.
The alternative is to eschew the convenience methods and use the explicit methods, using Charset.defaultCharset() when you really do want to use the default character set. You might even want to throw down a comment each time to let others (and your future self) know why you think the default is sufficient.
Finally, even if you think I'm completely off my rocker, here's a tip you can use. If you regularly get logs from your customers and the logs contain any user-provided information, configure the encoding to UTF-8. That way, you'll have a chance at reading the logs without going back to the customer to ask: What's your default system encoding? And if the default system encoding happens to have corrupted the user provided strings, you're out of luck.
In Java logging:
This simple step takes almost no time, in most cases has no measurable effect (assuming most of the log is comprised of characters in the ASCII range), and can save you time and frustration later. Do it before you check in your code today.