When reading the javadoc of DataInput specifically in the “Modified UTF-8” section. I come across three tables that say “0 bits 6-0″ ,”1 1 0 bits 10-6″,…,”1 0 bits 5-0”.
I’m a Java newbie so to me it looks like subtractions, not sure, but if that’s the case and we add it to the ones and zeros it would make 7 bits. As far as I know, a byte is made up of 8 bits.
What does these “0 bits 6-0…” mean?
Advertisement
Answer
The javadoc is telling you how each byte is divided.
Consider each byte as a vector of 8 individual elements (bits).
The first block has only one byte, and the corresponding possible bit values.
byte 1 bit number 7 6 5 4 3 2 1 0 bit value 0 ? ? ? ? ? ? ? <-- bits 6 - 0
This means that for characters encoded in one byte, the leading bit will always be 0. These are the characters from u0001
to u007F
.
The second block has two bytes and gets a bit more complicated
byte 1 byte 2 bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0 bit value 1 1 0 ? ? ? ? ? | 1 0 ? ? ? ? ? ? ^ ^ | | bits 10 to 6 of bits 5 to 0 of the utf-8 codepoint the utf-8 codepoint
These are the characters in the range from u0080
to u07FF
So for example, a symbol in this range is µ (micro sign).
In normal unicode the bytes are 11000010 10110101
Take a look at this character and see how it lines up with the bits for two-byte chars. You have
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 1 Bits 10-6 ------*-*-*----*-* ^-^-^-^-^-^----bits 5-0
You end up with
byte 1 byte 2 bit number 15 14 13 12 11 10 9 8 | 7 6 5 4 3 2 1 0 bit value 1 1 0 - - 0 1 0 | 1 0 1 1 0 1 0 1
Where bytes 11 and 12 would be 0 but I put a – in there just to show their (in)significance.
Sorry for the ascii art, I hope it helps.