I need to be able to turn a string, for instance "This and <those> are."
, into a string array of the form ["This and ", "<those>", " are."]
. I have been trying to using the String.split() command, and I’ve gotten this regex:
"(?=[<>])"
However, this just gets me ["This and ", "<those", "> are."]
. I can’t figure out a good regex to get the brackets all on the same element, and I also can’t have spaces between those brackets. So for instance, "This and <hey there> are."
Should be simply split to ["This and <hey there> are."]
. Ideally I’d like to just rely solely on the split command for this operation. Can anyone point me in the right direction?
Advertisement
Answer
Not actually possible; given that the ‘separator’ needs to match 0 characters it needs to be all lookahead/lookbehind, and those require fixed-size lookups; you need to look ahead arbitrarily far into the string to know if a space is going to occur or not, thus, what you want? Impossible.
Just write a regexp that FINDS the construct you want, that’s a lot simpler. Simply Pattern.compile("<\w+>")
(taking a select few liberties on what you intend a thing-in-brackets to look like. If truly it can be ANYTHING except spaces and the closing brace, "<[^ >]+>"
is what you want).
Then, just loop through, finding as you go:
private static final Pattern TOKEN_FINDER = Pattern.compile("<\w+>"); List<String> parse(String in) { Matcher m = TOKEN_FINDER.matcher(in); if (!m.find()) return List.of(in); var out = new ArrayList<String>(); int pos = 0; do { int s = m.start(); if (s > pos) out.add(in.substring(pos, s)); out.add(m.group()); pos = m.end(); } while (m.find()); if (pos < in.length()) out.add(in.substring(pos)); return out; }
Let’s try it:
System.out.println(parse("This and <those> are.")); System.out.println(parse("This and <hey there> are.")); System.out.println(parse("<edgecase>2")); System.out.println(parse("3<edgecase>"));
prints:
[This and , <those>, are.] [This and <hey there> are.] [<edgecase>] [<edgecase>, 2] [3, <edgecase>]
seems like what you wanted.