I have the following example string:
A|B|C\|D\|E\\F
with | being the delimiter and being the escape character. A proper split should look as follows:
A B|C\ D\|E\\ F
Also I need this logic to be generally applicable in case the delimiter or the escape consists out of multiple characters.
I already have a regex which splits at the correct position, but it does not produce the desired output:
Regex:
(?<!Q\E)(?:(Q\E)*)Q|E
Output:
A B|C D\|E F
I am usually testing here: https://regex101.com/, but am working in java so I have a little more capabilities.
Also tried the following with no positive result as well (doesn’t work on the webpage, but in java just doesn’t produce the desired result):
(?=(Q\E){0,5})(?<!Q\E)Q|E
Advertisement
Answer
Extracting approach
You can use a matching approach as it is the most stable and allows arbitrary amount of escaping chars. You can use
(?s)(?:\.|[^\|])+
See the regex demo. Details:
(?s)
–Pattern.DOTALL
embedded flag option(?:\.|[^\|])+
– one or more repetitions ofand then any one char, or any char but
and
|
.
See the Java demo:
String s = "A|B\|C\\|D\\\|E\\\\|F"; Pattern pattern = Pattern.compile("(?:\\.|[^\\|])+", Pattern.DOTALL); Matcher matcher = pattern.matcher(s); List<String> results = new ArrayList<>(); while (matcher.find()){ results.add(matcher.group()); } System.out.println(results); // => [A, B|C\, D\|E\\, F]
Splitting approach (workaround for split
)
You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000}
instead of *
quantifier. A work-around would look like
String s = "A|B\|C\\|D\\\|E\\\\|F"; String[] results = s.split("(?<=(?<!\\)(?:\\{2}){0,1000})\|"); System.out.println(Arrays.toString(results));
See this Java demo.
Note (?:\{2}){0,1000}
part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I’d still recommend the first solution.
Details:
(?<=
– start of a positive lookbehind:(?<!\)
– a location not immediately preceded with a(?:\{2}){0,1000}
– zero to one thousand occurrences of double backslash
)
– end of the positive lookbehind|
– a|
char.