Regex pattern matching is getting timed out



I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes(“x,y”).

The regex is – (?<!(?<!Q\E)Q\E)Q,E(?=(?:[^Q"E]*(?<=Q,E)Q"E[[^Q,E|Q"E] | [Q"E]]+[^Q"E]*[^Q\E]*[Q"E]*)*[^Q"E]*$)

The input string for which this split call is getting timed out is –

"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 ["BOLT,HI-JOK"]"

I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing ["BOLT,HI-JOK"] at the end of the string, then the regex is able to detect and split.

The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing ["BOLT,HI-JOK"] at the end of the string, then the regex is able to detect it.

I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs? Any pointers, article links are appreciated!

Answer

If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:

,(?="[^"\]*(?:\.[^"\]*)*")

The pattern matches:

  • , Match a comma
  • (?= Positive lookahad
    • "[^"\]* Match " and 0+ times any char except " or
    • (?:\.[^"\]*)*" Optionally repeat matching to escape any char using the . and again match any chars other than " and /
  • ) Close lookahead

Regex demo | Java demo

String string = ""","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"n";
String[] parts = string.split(",(?="[^"\\]*(?:\\.[^"\\]*)*")");
for (String part : parts)
    System.out.println(part);

Output

""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 ["BOLT,HI-JOK"]"


Source: stackoverflow