It seems that Files.newBufferedReader()
is more strict about UTF-8 than the naive alternative.
If I create a file with a single byte 128—so, not a valid UTF-8 character—it will happily be read if I construct an BufferedReader
on an InputStreamReader
on the result of Files.newInputStream()
, but with Files.newBufferedReader()
an exception is thrown.
This code
try ( InputStream in = Files.newInputStream(path); Reader isReader = new InputStreamReader(in, "UTF-8"); Reader reader = new BufferedReader(isReader); ) { System.out.println((char) reader.read()); } try ( Reader reader = Files.newBufferedReader(path); ) { System.out.println((char) reader.read()); }
has this result:
� Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read(BufferedReader.java:182) at TestUtf8.main(TestUtf8.java:28)
Is this documented? And is it possible to get the lenient behavior with Files.newBufferedReader()
?
Advertisement
Answer
The difference is in how the CharsetDecoder
used to decode the UTF-8 is constructed in the two cases.
For new InputStreamReader(in, "UTF-8")
the decoder is constructed using:
Charset cs = Charset.forName("UTF-8"); CharsetDecoder decoder = cs.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE);
This is explicitly specifying that invalid sequences are just replaced with the standard replacement character.
Files.newBufferedReader(path)
uses:
Charset cs = StandardCharsets.UTF_8; CharsetDecoder decoder = cs.newDecoder();
In this case onMalformedInput
and onUnmappableCharacter
are not being called so you get the default action which is to throw the exception you are seeing.
There does not seem to be a way to change what Files.newBufferedReader
does. I didn’t see anything documenting this while looking through the code.