Java and regex lexer

Tags: , ,



I am trying to make some sort of Lexer in Java using regex for a custom markdown “language” I’m making, it’s my first time working with this stuff so a little lost on a few things.
An example of a possible syntax in it is:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
I was able to capture a few things, for example I’m using (?<hex><#w+>) to capture the “hex” and (?<action>[[^]]*]([^]]*)) to get the entire “action” block.
My problem is being able to capture it all together, like, how to combine it all. For example the lexer needs to output something like:

TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!

I’ll handle the bold and italic later.
Would love just some suggestions on how to combine all of them!

Answer

One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [w!* ]+

In Java, you could check for the name of the capturing group.

(?<hex><#w+>)|(?<action>[[^]]*]([^]]*))|(?<text>[w!* ]+)

Explanation

  • (?<hex><#w+>) Capture group hex, match # and 1+ word chars
  • | Or
  • (?<action> Capture group action
    • [[^]]*]([^]]*) Match [] followed by (...)
  • ) Close group
  • | Or
  • (?<text>[w!* ]+) Capture group text, match 1+ times any char listed in the character class

Regex demo | Java demo

Example code:

String regex = "(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)";
String string = "Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    if (matcher.group("hex") != null) {
        System.out.println("HEX - " + matcher.group("hex"));    
    }
    if (matcher.group("text") != null) {
        System.out.println("TEXT - " + matcher.group("text"));  
    }
    if (matcher.group("action") != null) {
        System.out.println("ACTION - " + matcher.group("action"));  
    }
}

Output

TEXT - Some 
HEX - <#000000>
TEXT - *text* 
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT -  and **finally** some more 
HEX - <#000>
TEXT - text!


Source: stackoverflow