Java UTF-8 strange behaviour

Question

I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm But the output after conversion to String is invalid. Any ide…

Accepted Answer

The console which you&#8217;re outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.Alternatively, use a debugger to make sure the characters are what you expect. Just don&#8217;t trust the console.

Advertisement

Answer