Skip to content
Advertisement

What is the difference between chars() and codePoints() method in CharSequence interface?

I read javadoc, but don’t understand the differences, both of them return same result. Also can anyone explain what is ‘zero-extending’ means?

Javadoc of chars() method

Returns a stream of int zero-extending the char values from this sequence. Any char which maps to a surrogate code point is passed through uninterpreted. The stream binds to this sequence when the terminal stream operation commences (specifically, for mutable sequences the spliterator for the stream is late-binding). If the sequence is modified during that operation then the result is undefined.

Javadoc of codePoints() method

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream. The stream binds to this sequence when the terminal stream operation commences (specifically, for mutable sequences the spliterator for the stream is late-binding). If the sequence is modified during that operation then the result is undefined.

Advertisement

Answer

A ‘char’ is a 16-bit unsigned value in Java, so there are 65536 possible chars.

Unicode unfortunately now has more than 65536 characters, each of which is identified by a ‘codepoint’, which is a number from 0 to whatever.

It is therefore obviously not possible to represent every character as a single Java ‘char’. There are two choices available to the Java programmer for codepoints larger than 65535: a pair of chars (known as a surrogate pair) or else a single 32-bit integer codepoint.

The difference between char and codepoint shows up only for codepoints larger than 65535.

Note that the 32-bit ‘codepoint’ value is not simply the concatenation of the two 16-bit ‘char’ values. The surrogate pair is appropriately decoded.

Advertisement