Java regex. Keep special characters only when they’re surrounded by numbers



I’m making an word frequency program and I’m trying to process text to make it manageable. I’m trying to remove all special characters except $%^*+-=,./<> which are a part of a number. I have virtually no experience with regular expressions and after reading a bunch on it, I tried using the negative lookadead and negative lookaround to get something like

   String replace =  "[^a-z0-9\\s] | (?<!\d)[$%^*+\-=,./<>_] | [$%^*+\-=,./<>_](?!\d)";
   text.replaceAll(replace, "");

In short I want “they’re.” to become “theyre” but I want “1223.444” to remain unchanged.

Answer

You can use

text = text.replaceAll(replace, "[\p{P}\p{S}&&[^$%^*+=,./<>_-]]|[$%^*+=,./<>_-](?!(?<=\d.)\d)", "");

Details:

  • [p{P}p{S}&&[^$%^*+=,./<>_-]] – a character class intersection construct that matches any punctuation (p{P}) or symbol (p{S}) except $, %, ^, *, +, =, ,, ., /, <, >, _ and -
  • | – or
  • [$%^*+=,./<>_-](?!(?<=d.)d) – a $, %, ^, *, +, =, ,, ., /, <, >, _ or - char that is not immediately followed with a digit which is in its turn not immediately preceded with a digit and any char (. is used to match the symbol/punctuation consumed with [$%^*+=,./<>_-]).


Source: stackoverflow