Splitting strings through regular expressions by punctuation and whitespace etc in java

Question

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a But I know I am missing out on some words from the text file. For example, the word &#8220;can&#8217;t&#8221; should be divided into two words &#8220;can&#8…

Accepted Answer

You have one small mistake in your regex. Try this:String[] Res = Text.split("[\p{Punct}\s]+");[\p{Punct}\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.So I get for this codeString Text = "But I know. For example, the word "can't" should";String[] Res = Text.split("[\p{Punct}\s]+");System.out.println(Res.length);for (String s:Res){    System.out.println(s);}this result  10  But  I  know  For  example  the  word  can  t  should  Which should meet your requirement.As an alternative you can useString[] Res = Text.split("\P{L}+");\P{L} means is not a unicode code point that has the property &#8220;Letter&#8221;

Advertisement

Answer