Skip to content
Advertisement

Java Regular Expression split keeping contractions

When using split(), what regular expression would allow me to keep all word characters but would also preserve contractions like don’t won’t. Anything with word characters on both sides of the apostrophe but removes any leading or trailing apostraphes such as ’tis or dogs’.

I have:

String [] words = line.split("[^\w'+]+[\w+('*?)\w+]");

but it keeps the leading and trailing punctuation.

Input of 'Tis the season, for the children's happiness'.

Would produce an output of: Tis the season for the children's happiness

Any advice?

Advertisement

Answer

I would think: split on:

  • either apostrophe + at least one none-word char ['-]\W+,
  • or any none word chars [^\w'-]\W*.

    String line = "'Tis the season, for the children's happiness'";
    String[] words = line.split("(['-]\W+|[^\w'-]\W*)");
    System.out.println(Arrays.toString(words));
    

Here I added - as addition to apostrophe.

Result:

['Tis, the, season, for, the, children's, happiness']

Adding begin and end:

    String[] words = line.split("(^['-]|['-]$|['-]\W+|[^\w'-]\W*)");

Result:

[, Tis, the, season, for, the, children's, happiness]

which for the beginning yields an empty string.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement