Unicode normalization forms Explanation (Java)

Question

I&#8217;m using Normalizer.normalize(url, Normalizer.Form.NFD) to avoid having characters like é in my url, and I do not understand the meaning of the Normalizer.Form consts (NFC, NFD, NFKC, and NFKD) or when to use each one. I consulted the documentation but this did not help at all. Does anyone have any ide…

Accepted Answer

D=Decomposed e ´C=Composed éThe K is for ligatures, one letter ﬃ(ffi) or 3: f f i.This is mentioned in the javadoc:Characters with accents or other adornments can be encoded in severaldifferent ways in Unicode. For example, take the character A-acute. InUnicode, this can be encoded as a single character (the &#8220;composed&#8221;form):  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE or as two separate characters (the "decomposed" form):  U+0041    LATIN CAPITAL LETTER A  U+0301    COMBINING ACUTE ACCENT To a user of your program, however, both of these sequences should be treated as the same&#8220;user-level&#8221; character &#8220;A with acute accent&#8221;. When you are searchingor comparing text, you must ensure that these two sequences aretreated as equivalent. In addition, you must handle characters withmore than one accent. Sometimes the order of a character&#8217;s combiningaccents is significant, while in other cases accent sequences indifferent orders are really equivalent. Similarly, the string &#8220;ffi&#8221;can be encoded as three separate letters:  U+0066    LATIN SMALL LETTER F  U+0066    LATIN SMALL LETTER F  U+0069    LATIN SMALL LETTER I or as the single character  U+FB03    LATIN SMALL LIGATURE FFISo in your case you want NFKD, full decomposition.s = Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\p{M}", "");The latter replaceAlljust removes the combining diacritical marks, the zero-width accents ´. There are still problematic latin letters likeŀ Polish small L with strike-throughı Turkish small I without dotİ Turkish capital I with dotBut might already been doing a non-ASCII replace.Of course nowadays one might have Unicode URLs to some degree, sites with special characters.And with some care those characters would not get mangled.An other use of normalisation in decomposed form is for sorting country names alphabetically: Österreich (Austria in German) before P.Some DetailsThe K stands for &#8220;compatibility&#8221; and hence is important.One can have more than one accent (zero-width combining diacritical mark) at a letter.One can have a String with both composed and decomposed letters.So actually NFC does: Canonical decomposition, followed by canonical composition. So in order to do a good composition it is best to first decompose which does the Normalizer for you.Composition also has its use; for instance it is guaranteed canonical (single norming form), and is compact for String.codePointAt.

Advertisement

Answer