Skip to content
Advertisement

How to properly split on a non escaped delimiter?

I have the following example string:

A|B|C\|D\|E\\F

with | being the delimiter and being the escape character. A proper split should look as follows:

A
B|C\
D\|E\\
F

Also I need this logic to be generally applicable in case the delimiter or the escape consists out of multiple characters.

I already have a regex which splits at the correct position, but it does not produce the desired output:

Regex:

(?<!Q\E)(?:(Q\E)*)Q|E

Output:

A
B|C
D\|E
F

I am usually testing here: https://regex101.com/, but am working in java so I have a little more capabilities.

Also tried the following with no positive result as well (doesn’t work on the webpage, but in java just doesn’t produce the desired result):

(?=(Q\E){0,5})(?<!Q\E)Q|E

Advertisement

Answer

Extracting approach

You can use a matching approach as it is the most stable and allows arbitrary amount of escaping chars. You can use

(?s)(?:\.|[^\|])+

See the regex demo. Details:

  • (?s)Pattern.DOTALL embedded flag option
  • (?:\.|[^\|])+ – one or more repetitions of and then any one char, or any char but and |.

See the Java demo:

String s = "A|B\|C\\|D\\\|E\\\\|F";
Pattern pattern = Pattern.compile("(?:\\.|[^\\|])+", Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
    results.add(matcher.group());
} 
System.out.println(results); 
// => [A, B|C\, D\|E\\, F]

Splitting approach (workaround for split)

You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000} instead of * quantifier. A work-around would look like

String s = "A|B\|C\\|D\\\|E\\\\|F";
String[] results = s.split("(?<=(?<!\\)(?:\\{2}){0,1000})\|"); System.out.println(Arrays.toString(results));

See this Java demo.

Note (?:\{2}){0,1000} part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I’d still recommend the first solution.

Details:

  • (?<= – start of a positive lookbehind:
    • (?<!\) – a location not immediately preceded with a
    • (?:\{2}){0,1000} – zero to one thousand occurrences of double backslash
  • ) – end of the positive lookbehind
  • | – a | char.
Advertisement