Skip to content
Advertisement

Java UTF-8 strange behaviour

I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm

But the output after conversion to String is invalid. Any idea ?

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));

Output:

    {{69cc88}}
    >i?

Advertisement

Answer

The console which you’re outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.

Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.

Alternatively, use a debugger to make sure the characters are what you expect. Just don’t trust the console.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement