Fundamentals 11 min read

How to Use Antlr4 for Custom SQL Parsing in Spark Projects

This guide explains common business scenarios that require custom SQL parsing, walks through setting up Antlr4 in IntelliJ IDEA, configuring Maven dependencies, generating parser code, and provides Java examples for extracting table names from Spark SQL statements, including handling of prediction modes and execution results.

政采云技术

Dec 6, 2022

How to Use Antlr4 for Custom SQL Parsing in Spark Projects

In many real‑world applications, standard SQL parsers such as Druid cannot handle newer or non‑standard SQL dialects, so developers often need a custom solution. Typical use cases include extracting table and column names for authorization, building data lineage graphs, and detecting whether an expression is a window or aggregate function.

1. Business Scenarios

Authorization: Parse SQL to obtain referenced tables and fields.

Data Lineage: Identify input and output tables as well as the fields flowing between them.

Other Cases: Determine if an expression is a window function, aggregate function, etc.

Some queries (e.g., SELECT * FROM test_tab) or ambiguous column origins cannot be resolved by pure syntactic analysis; Spark’s physical plan can provide the missing metadata.

2. Setting Up Antlr4 in IntelliJ IDEA

Install the Antlr4 plugin.

Obtain the Spark SQL grammar files (e.g., SparkSqlBaseLexer.g4 and SparkSqlBaseParser.g4) from the Spark source tree

/spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser

Add the following Maven dependencies and plugin configuration to pom.xml:

<antlr4.version>4.10.1</antlr4.version>

<dependency>
   <groupId>org.antlr</groupId>
   <artifactId>antlr4-runtime</artifactId>
   <version>${antlr4.version}</version>
</dependency>

<plugin>
   <groupId>org.antlr</groupId>
   <artifactId>antlr4-maven-plugin</artifactId>
   <version>${antlr4.version}</version>
   <executions>
      <execution>
         <goals>
            <goal>antlr4</goal>
         </goals>
         <phase>none</phase>
      </execution>
   </executions>
   <configuration>
      <sourceDirectory>src/main/scala/cn/zcy/antlr4/spark/g4</sourceDirectory>
      <includes>
         <include>SparkSqlBaseLexer.g4</include>
         <include>SparkSqlBaseParser.g4</include>
      </includes>
      <visitor>true</visitor>
      <listener>true</listener>
   </configuration>
</plugin>

3. Generating Source Files

After configuring the plugin, right‑click the grammar folder in IDEA and select “Generate Sources”. The generated files include: SparkSql.tokens: token type definitions. SparkSqlLexer and SparkSqlParser: lexical and syntactic analyzers. SparkSqlBaseListener / SparkSqlListener and SparkSqlBaseVisitor / SparkSqlVisitor: default Listener and Visitor implementations.

4. Code Implementation

4.1 Viewing the Parse Tree

By feeding a sample SQL to the generated parser, you can inspect the parse tree rooted at the statement rule.

4.2 Parsing Example – Extracting Table Names

import cn.zhengcaiyun.idata.connector.parser.spark.SparkSqlLexer;
import cn.zhengcaiyun.idata.connector.parser.spark.SparkSqlParser;
import cn.zhengcaiyun.idata.connector.util.model.UpperCaseCharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.atn.PredictionMode;

public class SparkSqlHelper {
    public static void main(String[] args) {
        String sql = "SELECT T1.COL1,T2.COL2 
" +
                     "FROM TAB1 T1 
" +
                     "LEFT JOIN TAB2 T2 ON T1.ID=T2.ID";
        System.out.println(SparkSqlHelper.parse(sql));
    }

    public static List<String> parse(String sql) {
        UpperCaseCharStream charStream = new UpperCaseCharStream(CharStreams.fromString(sql));
        SparkSqlLexer lexer = new SparkSqlLexer(charStream);
        CommonTokenStream tokenStream = new CommonTokenStream(lexer);
        SparkSqlParser parser = new SparkSqlParser(tokenStream);
        parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
        SparkSqlAst visitor = new SparkSqlAst();
        return (List<String>) visitor.visit(parser.singleStatement());
    }
}

The SparkSqlAst visitor extracts table identifiers:

import cn.zcy.antlr4.spark.parser.SparkSqlBaseVisitor;
import cn.zcy.antlr4.spark.parser.SparkSqlParser;

public class SparkSqlAst extends SparkSqlBaseVisitor {
    private List<String> tableList = new ArrayList<>();

    @Override
    public List<String> visitSingleStatement(SparkSqlParser.SingleStatementContext ctx) {
        super.visitSingleStatement(ctx);
        return tableList;
    }

    @Override
    public List<String> visitTableIdentifier(SparkSqlParser.TableIdentifierContext ctx) {
        String db = ctx.db == null ? "" : ctx.db.getText();
        String tableName = ctx.table == null ? "" : ctx.table.getText();
        tableList.add(db.isEmpty() ? tableName : db + "." + tableName);
        return null;
    }
}

4.3 Prediction Modes

Antlr4 offers several prediction modes:

SLL: Fast, ignores full context; may fail on ambiguous inputs.

LL: Falls back when SLL throws ParseCancellationException, providing correct parse trees for most inputs.

LL_EXACT_AMBIG_DETECTION: Guarantees exhaustive ambiguity detection but offers no guarantees for syntactically invalid inputs.

5. Execution Result

The helper prints a list of tables found in the sample query, e.g., [TAB1, TAB2].

6. Reference

For deeper understanding, see “The Definitive ANTLR 4 Reference” by Terence Parr.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Java SQL parsing Spark Parser Antlr4

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.