Skip to content
Advertisement

Free space regex option (Pattern.COMMENTS) not working as expected

I’m trying to detect profanity using regex. But I want to detect the word even if they’ve spaced out the word like “Profa nity”. However when using the “(?x)” option it still doesn’t want to detect.

I currently got:

(?ix).*Bad Word.*

I’ve tried using http://www.rubular.com to debug the expression with not luck.

If it helps in any way it’s for at Teamspeak Bot where I want to kick the user for having banned words in their name. In the config it refers to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html where I can’t find anything relating to the (?) options.

The bot itself can be found here: https://forum.teamspeak.com/threads/51286-JTS3ServerMod-Multifunction-TS3-Server-Bot-(Idle-Record-Away-Mute-Welcome-)

Advertisement

Answer

when using the “(?x)” option it still doesn’t want to detect

The (?x) is an embedded flag option (also known as an inline modifier/option) enables the Pattern.COMMENTS option, also known as free-spacing mode that enables comments inside regular expressions and makes the regex engine ignore all regular whitespace inside the pattern. As per Free-Spacing in Character Classes:

In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs, and line breaks. Note that only whitespace between tokens is ignored. a b c is the same as abc in free-spacing mode. But d and d are not the same. The former matches d, while the latter matches a digit. d is a single regex token composed of a backslash and a "d". Breaking up the token with a space gives you an escaped space (which matches a space), and a literal “d”.

Likewise, grouping modifiers cannot be broken up. (?>atomic) is the same as (?> ato mic ) and as ( ?>ato mic). They all match the same atomic group. They’re not the same as (? >atomic). The latter is a syntax error. The ?> grouping modifier is a single element in the regex syntax, and must stay together. This is true for all such constructs, including lookaround, named groups, etc.

So, to match a single space in a pattern with the (?x) modifier, you need to escape it:

String reg = "(?ix).*Bad\ Word.*";   // Escaped space matches a space in free spacing mode
String reg = "(?ix).* Bad\ Word .*"; // More formatting spaces, same pattern

NOTE that you CAN’T put the space into a character class to make it meaningful in a Java regex. See below:

Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore spaces, line breaks, and comments inside character classes. So in Java’s free-spacing mode, [abc] is identical to [ a b c ].

Besides, I think you actually wanted to make sure your pattern can match full strings that may contain line breaks. That means, you need (?s), Pattern.DOTALL, modifier:

String reg = "(?is).*Bad Word.*";

Also, to match any whitespace, you may rely on s:

String reg = "(?ix).*Bad\sWord.*"; // To only match 1 whitespace
String reg = "(?ix).*Bad\s+Word.*"; // To account for 1 or more whitespaces
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement