Different results reading file with Files.newBufferedReader() and constructing readers directly

Question

It seems that Files.newBufferedReader() is more strict about UTF-8 than the naive alternative. If I create a file with a single byte 128&#8212;so, not a valid UTF-8 character&#8212;it will happily be read if I construct an BufferedReader on an InputStreamReader on the result of Files.newInputStream(), but wit…

Accepted Answer

The difference is in how the CharsetDecoder used to decode the UTF-8 is constructed in the two cases. For new InputStreamReader(in, "UTF-8") the decoder is constructed using:Charset cs = Charset.forName("UTF-8");CharsetDecoder decoder = cs.newDecoder()          .onMalformedInput(CodingErrorAction.REPLACE)          .onUnmappableCharacter(CodingErrorAction.REPLACE);This is explicitly specifying that invalid sequences are just replaced with the standard replacement character.Files.newBufferedReader(path) uses:Charset cs = StandardCharsets.UTF_8;CharsetDecoder decoder = cs.newDecoder();In this case onMalformedInput and onUnmappableCharacter are not being called so you get the default action which is to throw the exception you are seeing.There does not seem to be a way to change what Files.newBufferedReader does. I didn&#8217;t see anything documenting this while looking through the code.

Advertisement

Answer