How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII

Question

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can&#8217;t. So it&#8217;s been decided to henceforth forbid fil…

Accepted Answer

First, the easy cases:ASCIIIf your data contains no bytes above 0x7F, then it&#8217;s ASCII.  (Or a 7-bit ISO646 encoding, but those are very obsolete.)UTF-8If your data validates as UTF-8, then you can safely assume it is UTF-8.  Due to UTF-8&#8217;s strict validation rules, false positives are extremely rare.ISO-8859-1 vs. windows-1252The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ.  I&#8217;ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters.  So don&#8217;t even bother with them, or ISO-8859-1, just detect windows-1252 instead.That now leaves you with only one question.How do you distinguish MacRoman from cp1252?This is a lot trickier.Undefined charactersThe bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252.  If they occur, then assume the data is MacRoman.Identical charactersThe bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings.  If these are the only non-ASCII bytes, then it doesn&#8217;t matter whether you choose MacRoman or cp1252.Statistical approachCount character (NOT byte!) frequencies in the data you know to be UTF-8.  Determine the most frequent characters.  Then use this data to determine whether the cp1252 or MacRoman characters are more common.For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—.  Based on this fact,The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.

Advertisement

Answer

ASCII

UTF-8

ISO-8859-1 vs. windows-1252

How do you distinguish MacRoman from cp1252?

Undefined characters

Identical characters

Statistical approach