Skip to content
Advertisement

How to parse a Clickhouse-SQL statement using ANTRL4?

Objective : Add an additional WHERE clause to any given Clickhouse statement.

I’m using the following Antlr grammars to generate Java classes for a lexer & parser.

Lexer grammar

https://github.com/ClickHouse/ClickHouse/blob/master/utils/antlr/ClickHouseLexer.g4

Parser grammar

https://github.com/ClickHouse/ClickHouse/blob/master/utils/antlr/ClickHouseParser.g4

Problem : I cannot figure out/understand how to interact or create the appropriate POJOs for use with the generated classes that Antlr produces.

Example of statement

String query = "INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def')"

Goal of SQL (enrichment code)

String enrichedQuery = SqlParser.enrich(query);
System.out.println(enrichedQuery);

//Output
>> INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def') (WHERE X IN USERS)

I have the follow Java main

public class Hello {

    public static void main( String[] args) throws Exception{
        
        String query = "INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def')"
        ClickhouseLexer = new ClickhouseLexer(new ANTLRInputStream(query));
        
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ClickHouseParser = new ClickHouseParser (tokens);          
        ParseTreeWalker walker = new ParseTreeWalker();
    }
}

Advertisement

Answer

I’d suggest taking a look at TokenStreamRewriter.

First, let’s get the grammars ready.

1 – with TokenStreamRewriter we’ll want to preserve whitespace, so let’s change the -> skip directives to ->channel(HIDDEN)

At the end of the Lexer grammar:

// Comments and whitespace

MULTI_LINE_COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
SINGLE_LINE_COMMENT: '--' ~('n'|'r')* ('n' | 'r' | EOF) -> channel(HIDDEN);
WHITESPACE: [ u000Bu000Ctrn] -> channel(HIDDEN);  // 'n' can be part of multiline single query

2 – The C++ specific stuff just guards against using keywords more than once. You don’t really need that check for your purposes (and it could be done in a post-parse Listener if you DID need it). So let’s just lose the language specific stuff:

engineClause: engineExpr (
    orderByClause
    | partitionByClause
    | primaryKeyClause
    | sampleByClause
    | ttlClause
    | settingsClause
)*
;

and

dictionaryAttrDfnt
    : identifier columnTypeExpr (
        DEFAULT literal
        | EXPRESSION columnExpr
        | HIERARCHICAL
        | INJECTIVE
        | IS_OBJECT_ID
    )*
    ;
dictionaryEngineClause
    : dictionaryPrimaryKeyClause? (
        sourceClause
        | lifetimeClause
        | layoutClause
        | rangeClause
        | dictionarySettingsClause
    )*
    ;

NOTE: There seems to be an issue with the grammar not accepting the actual values for an insert statement:

insertStmt
    : INSERT INTO TABLE? (
        tableIdentifier
        | FUNCTION tableFunctionExpr
    ) columnsClause? dataClause
    ;

columnsClause
    : LPAREN nestedIdentifier (COMMA nestedIdentifier)* RPAREN
    ;
dataClause
    : FORMAT identifier              # DataClauseFormat
    | VALUES                         # DataClauseValues // <- problem on this line
    | selectUnionStmt SEMICOLON? EOF # DataClauseSelect
    ;

(I’m not going to try to fix that part, so I’ve commented your input to accommodate)

(It would also help if the top level rule needed with an EOF token; without that ANTLR just stops parsing after VALUE. An EOF at the end of a root rule is considered a best practice for exactly this reason.)

The Main program:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStreamRewriter;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

public class TSWDemo {
    public static void main(String... args) {
        new TSWDemo().run(CharStreams.fromString("INSERT INTO t VALUES /* (1, 'Hello, world'), (2, 'abc'), (3, 'def') */"));
    }

    public void run(CharStream charStream) {
        var lexer = new ClickHouseLexer(charStream);
        var tokenStream = new CommonTokenStream(lexer);
        var parser = new ClickHouseParser(tokenStream);
        var tsw = new TokenStreamRewriter(tokenStream);
        var listener = new TSWDemoListener(tsw);

        var queryStmt = parser.queryStmt();
        ParseTreeWalker.DEFAULT.walk(listener, queryStmt);

        System.out.println(tsw.getText());
    }
}

The Listener:

import org.antlr.v4.runtime.TokenStreamRewriter;

public class TSWDemoListener extends ClickHouseParserBaseListener {
    private TokenStreamRewriter tsw;

    public TSWDemoListener(TokenStreamRewriter tsw) {
        this.tsw = tsw;
    }
    @Override
    public void exitInsertStmt(ClickHouseParser.InsertStmtContext ctx) {
        tsw.insertAfter(ctx.getStop(), " (WHERE X IN USERS)");
    }

}

Output:

INSERT INTO t VALUES (WHERE X IN USERS) /* (1, 'Hello, world'), (2, 'abc'), (3, 'def') */
Advertisement