I’m trying to detect profanity using regex. But I want to detect the word even if they’ve spaced out the word like “Profa nity”. However when using the “(?x)” option it still doesn’t want to detect.
I currently got:
(?ix).*Bad Word.*
I’ve tried using http://www.rubular.com to debug the expression with not luck.
If it helps in any way it’s for at Teamspeak Bot where I want to kick the user for having banned words in their name. In the config it refers to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html where I can’t find anything relating to the (?) options.
The bot itself can be found here: https://forum.teamspeak.com/threads/51286-JTS3ServerMod-Multifunction-TS3-Server-Bot-(Idle-Record-Away-Mute-Welcome-)
Advertisement
Answer
when using the “(?x)” option it still doesn’t want to detect
The (?x)
is an embedded flag option (also known as an inline modifier/option) enables the Pattern.COMMENTS
option, also known as free-spacing mode that enables comments inside regular expressions and makes the regex engine ignore all regular whitespace inside the pattern. As per Free-Spacing in Character Classes:
In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs, and line breaks. Note that only whitespace between tokens is ignored.
a b c
is the same asabc
in free-spacing mode. Butd
andd
are not the same. The former matchesd
, while the latter matches a digit.d
is a single regex token composed of a backslash and a"d"
. Breaking up the token with a space gives you an escaped space (which matches a space), and a literal “d”.Likewise, grouping modifiers cannot be broken up.
(?>atomic)
is the same as(?> ato mic )
and as( ?>ato mic)
. They all match the same atomic group. They’re not the same as(? >atomic)
. The latter is a syntax error. The?>
grouping modifier is a single element in the regex syntax, and must stay together. This is true for all such constructs, including lookaround, named groups, etc.
So, to match a single space in a pattern with the (?x)
modifier, you need to escape it:
String reg = "(?ix).*Bad\ Word.*"; // Escaped space matches a space in free spacing mode String reg = "(?ix).* Bad\ Word .*"; // More formatting spaces, same pattern
NOTE that you CAN’T put the space into a character class to make it meaningful in a Java regex. See below:
Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore spaces, line breaks, and comments inside character classes. So in Java’s free-spacing mode,
[abc]
is identical to[ a b c ]
.
Besides, I think you actually wanted to make sure your pattern can match full strings that may contain line breaks. That means, you need (?s)
, Pattern.DOTALL
, modifier:
String reg = "(?is).*Bad Word.*";
Also, to match any whitespace, you may rely on s
:
String reg = "(?ix).*Bad\sWord.*"; // To only match 1 whitespace String reg = "(?ix).*Bad\s+Word.*"; // To account for 1 or more whitespaces