Skip to content

Should you always explicitly provide encoding in Java when converting between bytes and Strings?

I’m refactoring some old Java application. It uses HTTP requests to communicate with some external service, so it deals with bytes and Strings. The assumption is that UTF-8 encoding should be used. The thing I’m wondering about is – should I always explicitly provide the encoding when converting from Strings to bytes and vice versa? Or can I just rely on the file.encoding property which is actually “UTF-8” in my system? (so the below stated examples work fine on my computer)

I come across lines of code, which suggest encoding should be explicitly stated, like:

new String(bodyMessageBytes, "UTF-8");

But then again, at different places no encoding is explicitly stated, so the default one (from file.encoding property) will be taken I assume (InputStreamReader constructor):

BufferedReader lBufferedReader = new BufferedReader(new InputStreamReader(lPostMethod.getResponseBodyAsStream()));

or (here the String constructor uses explicit encoding, but String.getBytes() does not):

new String(lResponseAsString.getBytes(), Config.ENCODING_UTF8);

According to my understanding, I would use explicit encoding parameter in the last 2 examples and consequently throughout the whole application. Just wanted to make sure if that’s the right approach and that it’s not redundant.

Answer

TL;DR

Yes, you should always make sure the character encoding is defined the way your application needs it, and does not rely on some fact like “I know that file.encoding is always UTF-8”. So, go ahead and specify the encoding wherever it’s not yet done.

As already pointed out in comments, something like

new String(lResponseAsString.getBytes(), Config.ENCODING_UTF8);

should never be written.

The flawed idea behind such a piece of code is that lResponseAsString came from parsing some byte sequence into a String, but using the wrong encoding. So it tries to convert the String back to the original bytes and then parses the bytes again, this time with the correct encoding.

First of all, how can the author be sure what encoding was used in creating lResponseAsString? In choosing getBytes() as the inverse conversion, he assumes it was the platform default encoding.

Then there are encodings where getBytes() is not guaranteed to reproduce the original byte sequence, e.g. because some byte values are illegal in that encoding.

So then, we have a byte array that vaguely might resemble the original byte sequence, and then we hope that parsing that byte sequence as UTF-8 gives a valid result.