Skip to content
Advertisement

How to properly split on a non escaped delimiter?

I have the following example string:

JavaScript

with | being the delimiter and being the escape character. A proper split should look as follows:

JavaScript

Also I need this logic to be generally applicable in case the delimiter or the escape consists out of multiple characters.

I already have a regex which splits at the correct position, but it does not produce the desired output:

Regex:

JavaScript

Output:

JavaScript

I am usually testing here: https://regex101.com/, but am working in java so I have a little more capabilities.

Also tried the following with no positive result as well (doesn’t work on the webpage, but in java just doesn’t produce the desired result):

JavaScript

Advertisement

Answer

Extracting approach

You can use a matching approach as it is the most stable and allows arbitrary amount of escaping chars. You can use

JavaScript

See the regex demo. Details:

  • (?s)Pattern.DOTALL embedded flag option
  • (?:\.|[^\|])+ – one or more repetitions of and then any one char, or any char but and |.

See the Java demo:

JavaScript

Splitting approach (workaround for split)

You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000} instead of * quantifier. A work-around would look like

JavaScript

See this Java demo.

Note (?:\{2}){0,1000} part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I’d still recommend the first solution.

Details:

  • (?<= – start of a positive lookbehind:
    • (?<!\) – a location not immediately preceded with a
    • (?:\{2}){0,1000} – zero to one thousand occurrences of double backslash
  • ) – end of the positive lookbehind
  • | – a | char.
Advertisement