I want to split an input string based on the regex pattern using Pattern.split(String)
api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes(“x,y”).
The regex is – (?<!(?<!Q\E)Q\E)Q,E(?=(?:[^Q"E]*(?<=Q,E)Q"E[[^Q,E|Q"E] | [Q"E]]+[^Q"E]*[^Q\E]*[Q"E]*)*[^Q"E]*$)
The input string for which this split call is getting timed out is –
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 ["BOLT,HI-JOK"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing ["BOLT,HI-JOK"]
at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3
with the above string. But if I remove the backward slashes enclosing ["BOLT,HI-JOK"]
at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs? Any pointers, article links are appreciated!
Advertisement
Answer
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\]*(?:\.[^"\]*)*")
The pattern matches:
,
Match a comma(?=
Positive lookahad"[^"\]*
Match"
and 0+ times any char except"
or(?:\.[^"\]*)*"
Optionally repeat matchingto escape any char using the
.
and again match any chars other than"
and/
)
Close lookahead
String string = ""","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"n"; String[] parts = string.split(",(?="[^"\\]*(?:\\.[^"\\]*)*")"); for (String part : parts) System.out.println(part);
Output
"" "1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]" "QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 ["BOLT,HI-JOK"]"