Skip to content

Separate definitions of decimal number and word in ANTLR grammar

I’m working on defining a grammar in ANTLR4 which includes words and numbers separately.

Numbers are described:

 NUM
   : INTEGER+ ('.' INTEGER+)?
   ;

fragment INTEGER
   : ('0' .. '9')
   ;

and words are described:

WORD
   : VALID_CHAR +
   ;

fragment VALID_CHAR
   : ('a' .. 'z') | ('A' .. 'Z') 
   ;

The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):

expression
   :  left = expression '+' right = expression #addition
   |  value = WORD #word
   |  value = NUM #num
   ;

The issue is that when I enter ‘d3’ into the parser, I get a returned instance of a Word ‘d’. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that ‘d3’ or any similar strings returns an error message from the grammar?

I’ve looked at the ‘~’ symbol but that seems to be ‘everything except’, rather than ‘only’.

To summarize, I’m looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.

Similar to the message received when ‘3+’ is entered:

simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}

At present, the following occurs:

d --> (d) (word) (correct)

22.3 --> (22.2) number (correct)

d3 --> d (word) (incorrect)
 
22f.4 --> 22 (number) (incorrect)

But ideally the following would happen :

d --> (d) (word) (correct)

22.3 --> (22.2) number (correct)

d3 --> (error)

22f.4 --> (error)

Answer

[Revised to response to revised question and comments]

ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it’s reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word (‘d’) and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)

You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:

root: expression EOF;

With this rule in place you’ll get ‘mismatched input’ at the ‘3’ in ‘d3’.

This same rule would give a ‘mismatched input’ at the ‘f’ character in ’22f.4′.


That should address the specific question you’ve asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.

Your comment (sort of) implies that you’d prefer to see error messages along the lines of “you have a digit in your word”, or “you have a letter in you number”

It helps to understand ANTLR’s pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.

Your ‘d3’ input produces a stream of 2 tokens with your current grammar;

WORD ('d')
NUM ('3')

This stream of tokens is what is being matched against in your parser rules (i.e. expression).
’22f.4′ results in the stream:

NUM ('22')
WORD ('f') 
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')

As soon as ANTLR saw something other than a number (or ‘.’) while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)

This is standard lexing/parsing behavior.

You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you’ll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.

IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:

BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;

(put it last in the grammar so that the valid streams will match first)

Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.

Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it’s not uncommon to find a “catch-all” Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.