Java – How to handle special characters when compressing bytes (Huffman encoding)?

Question

I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequenc…

Accepted Answer

Encodings matterIf I take the character &#8220;ö&#8221; and read it in my file it will now berepresented by 2 different values (191 and 182 or something like that)when its actual ASCII table value is 148.That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]In ISO/IEC 8859-1 (= &#8220;Latin-1&#8221;) it would be stored as hex [0xf6], or [246]In Mac OS Central European, it would be hex [0x9a] or [154]Please note, that the basic ASCII table itself doesn&#8217;t really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.Part of the problem is that in layman&#8217;s terms, &#8220;ASCII&#8221; is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)HistoryThere&#8217;s actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren&#8217;t enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called &#8220;unicode&#8221;. It contains not just 128 or 256 characters, but thousands of them.From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.Each ASCII character is still mapped in the same way, but when 1-byte isn&#8217;t enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.ToolsIf you&#8217;re using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.In programmingHaving said that, your current java source reads bytes, it&#8217;s not reading characters. And I would think that it&#8217;s a plus when it works on byte-level, because then it can support all encodings. Maybe you don&#8217;t need to work on character level at all.However, if it does matter for your specific algorithm. Let&#8217;s say you&#8217;ve written an algorithm that is only supposed to handle Latin-1 encoding. So, then it&#8217;s really going to work on &#8220;character-level&#8221; and not on &#8220;byte-level&#8221;. In that case, consider reading directly to String or char[].Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.Trying to convert bytes to characters manually is a tricky business. Unless you&#8217;re working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you&#8217;re no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There&#8217;s no need to write a decoding algorithm from scratch.

Advertisement

Answer

Encodings matter

History

Tools

In programming