What does “bits 6-0” or “bits 10-6 ” mean in the javadoc of DataInput?



When reading the javadoc of DataInput specifically in the “Modified UTF-8” section. I come across three tables that say “0 bits 6-0″ ,”1 1 0 bits 10-6″,…,”1 0 bits 5-0”.

I’m a Java newbie so to me it looks like subtractions, not sure, but if that’s the case and we add it to the ones and zeros it would make 7 bits. As far as I know, a byte is made up of 8 bits.

What does these “0 bits 6-0…” mean?

Answer

The javadoc is telling you how each byte is divided.

Consider each byte as a vector of 8 individual elements (bits).

The first block has only one byte, and the corresponding possible bit values.

byte 1

bit number 7 6 5 4 3 2 1 0
bit value  0 ? ? ? ? ? ? ? <-- bits 6 - 0

This means that for characters encoded in one byte, the leading bit will always be 0. These are the characters from u0001 to u007F.

The second block has two bytes and gets a bit more complicated

byte 1                       byte 2

bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0
bit value   1  1  0  ?  ?  ? ? ? | 1 0 ? ? ? ? ? ?
                          ^                 ^
                          |                 |
                   bits 10 to 6 of       bits 5 to 0 of
                 the utf-8 codepoint    the utf-8 codepoint

These are the characters in the range from u0080 to u07FF

So for example, a symbol in this range is µ (micro sign).

In normal unicode the bytes are 11000010 10110101

Take a look at this character and see how it lines up with the bits for two-byte chars. You have

15 14 13 12 11 10 9 8    7 6 5 4 3 2 1 0
 1  1  0  0  0  0 1 0    1 0 1 1 0 1 0 1

Bits 10-6 ------*-*-*----*-* ^-^-^-^-^-^----bits 5-0

You end up with

byte 1                       byte 2

bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0
bit value   1  1  0  -  -  0 1 0 | 1 0 1 1 0 1 0 1

Where bytes 11 and 12 would be 0 but I put a – in there just to show their (in)significance.

Sorry for the ascii art, I hope it helps.



Source: stackoverflow