Skip to content
Advertisement

How to grab all words that start with capital letters?

I want to create a Java regular expression to grab all words that start with a capital letter then capital or small letters, but those letters may contain accents.

Examples :

Where

Àdónde

Rápido

Àste

Can you please help me with that ?

Advertisement

Answer

Regex:

bp{Lu}p{L}*b

Java string:

"(?U)\b\p{Lu}\p{L}*\b"

Explanation:

b      # Match at a word boundary (start of word)
p{Lu}  # Match an uppercase letter
p{L}*  # Match any number of letters (any case)
b      # Match at a word boundary (end of word)

Caveat: This only works correctly in very recent Java versions (JDK7); for others you may need to substitute a longer sub-regex for b. As you can see here, you may need to use (kudos to @tchrist)

(?:(?<=[pLpMp{Nd}p{Nl}p{Pc}[p{InEnclosedAlphanumerics}&&p{So}]])(?![pLpMp{Nd}p{Nl}p{Pc}[p{InEnclosedAlphanumerics}&&p{So}]])|(?<![pLpMp{Nd}p{Nl}p{Pc}[p{InEnclosedAlphanumerics}&&p{So}]])(?=[pLpMp{Nd}p{Nl}p{Pc}[p{InEnclosedAlphanumerics}&&p{So}]]))

for b, so the Java string would look like this:

"(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\]))\p{Lu}\p{L}*(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}\[\p{InEnclosedAlphanumerics}&&\p{So}]\]))"
Advertisement