I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: “If first 500 bytes of a file contain 5 or more Control characters – report it as binary file”
thank you.
public boolean isAsciiText(String fileName) throws IOException { InputStream in = new FileInputStream(fileName); byte[] bytes = new byte[500]; in.read(bytes, 0, bytes.length); int x = 0; short bin = 0; for (byte thisByte : bytes) { char it = (char) thisByte; if (!Character.isWhitespace(it) && Character.isISOControl(it)) { bin++; } if (bin >= 5) { return false; } x++; } in.close(); return true; }
Advertisement
Answer
Since you call this class “isASCIIText”, you know exactly what you’re looking for. In other words, it’s not “isTextInCurrentLocaleEncoding”. Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it’s pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It’d probably be better to use a table of “ok” bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don’t think many modern documents use those), and then check the table.