Regex to identify consecutive and non-consecutive duplicate words in multiline text

I’m writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.

What is required:

Find any duplicate words (consecutive and non-consecutive) in the multiline file.

// Example_1 (duplicate 'test'):
item1  , test, item3   ;
test , item6;

// Example_2 (duplicate 'test'):
item1  , test, test   ;

I’ve tried to apply the (w+)(s*Ws*w*)*1 pattern, which doesn’t catch duplicate properly.


You may use this regex with mode DOTALL (single line):


RegEx Demo

RegEx Details:

  • (?s): Enable DOTALL mode
  • (bw+b): Match a complete word and capture it in group #1
  • (?=.*b1b): Lookahead to assert that we have back-reference 1 present somewhere ahead. b is used to make sure we match exact same word again.


Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:


RegEx Demo 2

There is one extra negative lookahead assertion here to make sure we don’t match consecutive repeats.

  • (?!W+1b): Negative lookahead to fail the match for consecutive repeats.