Skip to content
Advertisement

Regex Pattern with Unicode doesn’t do case folding

In C# it appears that Grüsse and Grüße are considered equal in most circumstances as is explained by this nice webpage. I’m trying to find a similar behavior in Java – obviously not in java.lang.String.

I thought I was in luck with java.regex.Pattern in combination with Pattern.UNICODE_CASE. The Javadoc says:

UNICODE_CASE enables Unicode-aware case folding. When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard.

Yet the following code:

Pattern p = Pattern.compile(Pattern.quote("Grüsse"), 
                     Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
System.out.println(p.matcher("Grüße").matches());

yields false. Why? And is there an alternative way of reproducing the C# case folding behavior?

—- edit —-

As @VGR pointed out, String.toUpperCase will convert ß to ss, which may or may not be case folding (maybe I’m confusing concepts here). However other characters in the German locale are not “folded”, for instance ü does not become UE. So to make my initial example more complete, is there a way to make Grüße and Gruesse compare equal in Java?

I was thinking the java.text.Normalizer class could be used to do just that, but it converts ü to u? rather than ue. It also hasn’t an option to provide a Locale, which confuses me even more.

Advertisement

Answer

For reference, the following facts:

  • Character.toUpperCase() cannot do case folding, as one character must map to one character.

  • String.toUpperCase() will do case folding.

  • String.equalsIgnoreCase() uses Character.toUpperCase() internally, so doesn’t do case folding.

Conclusion (as @VGR pointed out): if you need case insensitive matching with case folding, you need to do:

foo.toUpperCase().equals(bar.toUpperCase())

and not:

foo.equalsIgnoreCase(bar)

As for the ü and ue equality, I’ve managed to do it with a RuleBasedCollator and my own rules (one would expect Locale.German had that built-in but alas). It looked really silly/over-engineered, and since I needed only the equality, not the sorting/collating, in the end I’ve settled for a simple set of String.replace prior to comparison. It sucks but it works and is transparent/readable.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement