I am trying to make some sort of Lexer in Java using regex for a custom markdown “language” I’m making, it’s my first time working with this stuff so a little lost on a few things.
An example of a possible syntax in it is:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
I was able to capture a few things, for example I’m using (?<hex><#w+>)
to capture the “hex” and (?<action>[[^]]*]([^]]*))
to get the entire “action” block.
My problem is being able to capture it all together, like, how to combine it all. For example the lexer needs to output something like:
TEXT - Some HEX - <#000000> TEXT - *text* ACTION - [<#ffffff>Some more](action: Other <#gradient>text) TEXT - and **finally** some more HEX - <#000> TEXT - text!
I’ll handle the bold and italic later.
Would love just some suggestions on how to combine all of them!
Advertisement
Answer
One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [w!* ]+
In Java, you could check for the name of the capturing group.
(?<hex><#w+>)|(?<action>[[^]]*]([^]]*))|(?<text>[w!* ]+)
Explanation
(?<hex><#w+>)
Capture grouphex
, match # and 1+ word chars|
Or(?<action>
Capture groupaction
[[^]]*]([^]]*)
Match[
…]
followed by(...)
)
Close group|
Or(?<text>[w!* ]+)
Capture grouptext
, match 1+ times any char listed in the character class
Example code:
String regex = "(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)"; String string = "Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(string); while (matcher.find()) { if (matcher.group("hex") != null) { System.out.println("HEX - " + matcher.group("hex")); } if (matcher.group("text") != null) { System.out.println("TEXT - " + matcher.group("text")); } if (matcher.group("action") != null) { System.out.println("ACTION - " + matcher.group("action")); } }
Output
TEXT - Some HEX - <#000000> TEXT - *text* ACTION - [<#ffffff>Some more](action: Other <#gradient>text) TEXT - and **finally** some more HEX - <#000> TEXT - text!