Skip to content
Advertisement

Get all possible matches in a regex match (In Java)?

I am using a regex to match few possible values in a string that coming with my objects, there I need to get all possible values that are matching from my string as below,

If my string value is "This is the code ABC : xyz use for something".

Here is my code that I am using to extract matchers,

String my_regex = "(ABC|ABC :).*";

List <String> matchers = Pattern.compile(my_regex, Pattern.CASE_INSENSITIVE)
                                .matcher(my_string)
                                .results()
                                .map(MatchResult::group)
                                .collect(Collection.toList());

I am expecting the 2 list items as the output > {“ABC”, “ABC :”}, But I am only getting one. Help would be highly appreciated.

Advertisement

Answer

What you describe just isn’t how regex engines work. They do not find all possible variant search results; they simply consume and give you all results, moving forward. In other words, had you written:

String my_regex = "(ABC|ABC :)"; // note, get rid of the .*
String myString = "This is the code ABC : xyz use for something ABC again";

Then you’d get 2 results back – ABC : and ABC.

Yes, the regex could just as easily match just the ABC part instead of the ABC : part and it would still be valid. However, regexp matching is by default ‘greedy’ – it will match as much as it can. For some operators (specifically, * and +) you can use the non-greedy variants: *? and +? which will match as little as possible.

In other words, given:

String regex = "(a*?)(a+)";
String myString = "aaaaa";

Then group 1 would match 0 a (that’s the shortest string that can match (a*?) whilst still being able to match the entire regex to the input), and group 2 would be aaaaa.

If, on the other hand, you wrote (a*)(a+), then group 1 would be aaaa and group 2 would be a. It is not possible to ask the regexp engine to provide for you the combinatory explosion, with every possible length of ‘a’ – which appears to be what you want. The regexp API that ships with java does not have any option to do this, nor does any other regexp API I know of, so you’d have to write that yourself, perhaps. I admit I haven’t scoured the web for every possible alternate regex engine impl for java, there are a bunch of third party libraries, perhaps one of them can do it.

NB: I said at the start: Get rid of the .*. That’s because otherwise it’s still just the one match: ABC : xyz use for something ABC again is the longest possible match and given that regex engines are greedy, that’s what you will get: It is a valid ‘interpretation’ of your string (1 match), consuming the most – that’s how it works.

NB2: Greediness can never change whether a regex even matches or not. It just changes which of the input is assigned to which group, and when find()ing more than once (which .results() does – it find()s until no more matches are found – which matches you get.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement