I’m making an word frequency program and I’m trying to process text to make it manageable. I’m trying to remove all special characters except $%^*+-=,./<> which are a part of a number. I have virtually no experience with regular expressions and after reading a bunch on it, I tried using the negative lookadead and negative lookaround to get something like
String replace = "[^a-z0-9\\s] | (?<!\d)[$%^*+\-=,./<>_] | [$%^*+\-=,./<>_](?!\d)"; text.replaceAll(replace, "");
In short I want “they’re.” to become “theyre” but I want “1223.444” to remain unchanged.
Advertisement
Answer
You can use
text = text.replaceAll(replace, "[\p{P}\p{S}&&[^$%^*+=,./<>_-]]|[$%^*+=,./<>_-](?!(?<=\d.)\d)", "");
Details:
[p{P}p{S}&&[^$%^*+=,./<>_-]]
– a character class intersection construct that matches any punctuation (p{P}
) or symbol (p{S}
) except$
,%
,^
,*
,+
,=
,,
,.
,/
,<
,>
,_
and-
|
– or[$%^*+=,./<>_-](?!(?<=d.)d)
– a$
,%
,^
,*
,+
,=
,,
,.
,/
,<
,>
,_
or-
char that is not immediately followed with a digit which is in its turn not immediately preceded with a digit and any char (.
is used to match the symbol/punctuation consumed with[$%^*+=,./<>_-]
).