Skip to content

How to parse a Clickhouse-SQL statement using ANTRL4?

Objective : Add an additional WHERE clause to any given Clickhouse statement.

I’m using the following Antlr grammars to generate Java classes for a lexer & parser.

Lexer grammar

Parser grammar

Problem : I cannot figure out/understand how to interact or create the appropriate POJOs for use with the generated classes that Antlr produces.

Example of statement

String query = "INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def')"

Goal of SQL (enrichment code)

String enrichedQuery = SqlParser.enrich(query);

>> INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def') (WHERE X IN USERS)

I have the follow Java main

public class Hello {

    public static void main( String[] args) throws Exception{
        String query = "INSERT INTO t VALUES (1, 'Hello, world'), (2, 'abc'), (3, 'def')"
        ClickhouseLexer = new ClickhouseLexer(new ANTLRInputStream(query));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ClickHouseParser = new ClickHouseParser (tokens);          
        ParseTreeWalker walker = new ParseTreeWalker();



I’d suggest taking a look at TokenStreamRewriter.

First, let’s get the grammars ready.

1 – with TokenStreamRewriter we’ll want to preserve whitespace, so let’s change the -> skip directives to ->channel(HIDDEN)

At the end of the Lexer grammar:

// Comments and whitespace

MULTI_LINE_COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
SINGLE_LINE_COMMENT: '--' ~('n'|'r')* ('n' | 'r' | EOF) -> channel(HIDDEN);
WHITESPACE: [ u000Bu000Ctrn] -> channel(HIDDEN);  // 'n' can be part of multiline single query

2 – The C++ specific stuff just guards against using keywords more than once. You don’t really need that check for your purposes (and it could be done in a post-parse Listener if you DID need it). So let’s just lose the language specific stuff:

engineClause: engineExpr (
    | partitionByClause
    | primaryKeyClause
    | sampleByClause
    | ttlClause
    | settingsClause


    : identifier columnTypeExpr (
        DEFAULT literal
        | EXPRESSION columnExpr
        | INJECTIVE
        | IS_OBJECT_ID
    : dictionaryPrimaryKeyClause? (
        | lifetimeClause
        | layoutClause
        | rangeClause
        | dictionarySettingsClause

NOTE: There seems to be an issue with the grammar not accepting the actual values for an insert statement:

        | FUNCTION tableFunctionExpr
    ) columnsClause? dataClause

    : LPAREN nestedIdentifier (COMMA nestedIdentifier)* RPAREN
    : FORMAT identifier              # DataClauseFormat
    | VALUES                         # DataClauseValues // <- problem on this line
    | selectUnionStmt SEMICOLON? EOF # DataClauseSelect

(I’m not going to try to fix that part, so I’ve commented your input to accommodate)

(It would also help if the top level rule needed with an EOF token; without that ANTLR just stops parsing after VALUE. An EOF at the end of a root rule is considered a best practice for exactly this reason.)

The Main program:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStreamRewriter;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

public class TSWDemo {
    public static void main(String... args) {
        new TSWDemo().run(CharStreams.fromString("INSERT INTO t VALUES /* (1, 'Hello, world'), (2, 'abc'), (3, 'def') */"));

    public void run(CharStream charStream) {
        var lexer = new ClickHouseLexer(charStream);
        var tokenStream = new CommonTokenStream(lexer);
        var parser = new ClickHouseParser(tokenStream);
        var tsw = new TokenStreamRewriter(tokenStream);
        var listener = new TSWDemoListener(tsw);

        var queryStmt = parser.queryStmt();
        ParseTreeWalker.DEFAULT.walk(listener, queryStmt);


The Listener:

import org.antlr.v4.runtime.TokenStreamRewriter;

public class TSWDemoListener extends ClickHouseParserBaseListener {
    private TokenStreamRewriter tsw;

    public TSWDemoListener(TokenStreamRewriter tsw) {
        this.tsw = tsw;
    public void exitInsertStmt(ClickHouseParser.InsertStmtContext ctx) {
        tsw.insertAfter(ctx.getStop(), " (WHERE X IN USERS)");



INSERT INTO t VALUES (WHERE X IN USERS) /* (1, 'Hello, world'), (2, 'abc'), (3, 'def') */