I’m making an word frequency program and I’m trying to process text to make it manageable. I’m trying to remove all special characters except $%^*+-=,./<> which are a part of a number. I have virtually no experience with regular expressions and after reading a bunch on it, I tried using the negative lookadead and negative lookaround to get something like
String replace = "[^a-z0-9\\s] | (?<!\d)[$%^*+\-=,./<>_] | [$%^*+\-=,./<>_](?!\d)"; text.replaceAll(replace, "");
In short I want “they’re.” to become “theyre” but I want “1223.444” to remain unchanged.
Advertisement
Answer
You can use
text = text.replaceAll(replace, "[\p{P}\p{S}&&[^$%^*+=,./<>_-]]|[$%^*+=,./<>_-](?!(?<=\d.)\d)", "");
Details:
[p{P}p{S}&&[^$%^*+=,./<>_-]]– a character class intersection construct that matches any punctuation (p{P}) or symbol (p{S}) except$,%,^,*,+,=,,,.,/,<,>,_and-|– or[$%^*+=,./<>_-](?!(?<=d.)d)– a$,%,^,*,+,=,,,.,/,<,>,_or-char that is not immediately followed with a digit which is in its turn not immediately preceded with a digit and any char (.is used to match the symbol/punctuation consumed with[$%^*+=,./<>_-]).