Skip to content

Regex to identify consecutive and non-consecutive duplicate words in multiline text

I’m writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.

What is required:

Find any duplicate words (consecutive and non-consecutive) in the multiline file.

// Example_1 (duplicate 'test'):
item1  , test, item3   ;
item4,item5;
test , item6;

// Example_2 (duplicate 'test'):
item1  , test, test   ;
item2,item3;

I’ve tried to apply the (w+)(s*Ws*w*)*1 pattern, which doesn’t catch duplicate properly.

Answer

You may use this regex with mode DOTALL (single line):

(?s)(bw+b)(?=.*b1b)

RegEx Demo

RegEx Details:

  • (?s): Enable DOTALL mode
  • (bw+b): Match a complete word and capture it in group #1
  • (?=.*b1b): Lookahead to assert that we have back-reference 1 present somewhere ahead. b is used to make sure we match exact same word again.

Additionally:

Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:

(?s)(bw+b)(?!W+1b)(?=.*b1b)

RegEx Demo 2

There is one extra negative lookahead assertion here to make sure we don’t match consecutive repeats.

  • (?!W+1b): Negative lookahead to fail the match for consecutive repeats.