How to Use Antlr4 for Custom SQL Parsing in Spark Projects
This guide explains common business scenarios that require custom SQL parsing, walks through setting up Antlr4 in IntelliJ IDEA, configuring Maven dependencies, generating parser code, and provides Java examples for extracting table names from Spark SQL statements, including handling of prediction modes and execution results.
In many real‑world applications, standard SQL parsers such as Druid cannot handle newer or non‑standard SQL dialects, so developers often need a custom solution. Typical use cases include extracting table and column names for authorization, building data lineage graphs, and detecting whether an expression is a window or aggregate function.
1. Business Scenarios
Authorization: Parse SQL to obtain referenced tables and fields.
Data Lineage: Identify input and output tables as well as the fields flowing between them.
Other Cases: Determine if an expression is a window function, aggregate function, etc.
Some queries (e.g., SELECT * FROM test_tab) or ambiguous column origins cannot be resolved by pure syntactic analysis; Spark’s physical plan can provide the missing metadata.
2. Setting Up Antlr4 in IntelliJ IDEA
Install the Antlr4 plugin.
Obtain the Spark SQL grammar files (e.g., SparkSqlBaseLexer.g4 and SparkSqlBaseParser.g4) from the Spark source tree
/spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser.
Add the following Maven dependencies and plugin configuration to pom.xml:
<antlr4.version>4.10.1</antlr4.version>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>${antlr4.version}</version>
</dependency>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>${antlr4.version}</version>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
<phase>none</phase>
</execution>
</executions>
<configuration>
<sourceDirectory>src/main/scala/cn/zcy/antlr4/spark/g4</sourceDirectory>
<includes>
<include>SparkSqlBaseLexer.g4</include>
<include>SparkSqlBaseParser.g4</include>
</includes>
<visitor>true</visitor>
<listener>true</listener>
</configuration>
</plugin>3. Generating Source Files
After configuring the plugin, right‑click the grammar folder in IDEA and select “Generate Sources”. The generated files include: SparkSql.tokens: token type definitions. SparkSqlLexer and SparkSqlParser: lexical and syntactic analyzers. SparkSqlBaseListener / SparkSqlListener and SparkSqlBaseVisitor / SparkSqlVisitor: default Listener and Visitor implementations.
4. Code Implementation
4.1 Viewing the Parse Tree
By feeding a sample SQL to the generated parser, you can inspect the parse tree rooted at the statement rule.
4.2 Parsing Example – Extracting Table Names
import cn.zhengcaiyun.idata.connector.parser.spark.SparkSqlLexer;
import cn.zhengcaiyun.idata.connector.parser.spark.SparkSqlParser;
import cn.zhengcaiyun.idata.connector.util.model.UpperCaseCharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.atn.PredictionMode;
public class SparkSqlHelper {
public static void main(String[] args) {
String sql = "SELECT T1.COL1,T2.COL2
" +
"FROM TAB1 T1
" +
"LEFT JOIN TAB2 T2 ON T1.ID=T2.ID";
System.out.println(SparkSqlHelper.parse(sql));
}
public static List<String> parse(String sql) {
UpperCaseCharStream charStream = new UpperCaseCharStream(CharStreams.fromString(sql));
SparkSqlLexer lexer = new SparkSqlLexer(charStream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
SparkSqlParser parser = new SparkSqlParser(tokenStream);
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
SparkSqlAst visitor = new SparkSqlAst();
return (List<String>) visitor.visit(parser.singleStatement());
}
}The SparkSqlAst visitor extracts table identifiers:
import cn.zcy.antlr4.spark.parser.SparkSqlBaseVisitor;
import cn.zcy.antlr4.spark.parser.SparkSqlParser;
public class SparkSqlAst extends SparkSqlBaseVisitor {
private List<String> tableList = new ArrayList<>();
@Override
public List<String> visitSingleStatement(SparkSqlParser.SingleStatementContext ctx) {
super.visitSingleStatement(ctx);
return tableList;
}
@Override
public List<String> visitTableIdentifier(SparkSqlParser.TableIdentifierContext ctx) {
String db = ctx.db == null ? "" : ctx.db.getText();
String tableName = ctx.table == null ? "" : ctx.table.getText();
tableList.add(db.isEmpty() ? tableName : db + "." + tableName);
return null;
}
}4.3 Prediction Modes
Antlr4 offers several prediction modes:
SLL: Fast, ignores full context; may fail on ambiguous inputs.
LL: Falls back when SLL throws ParseCancellationException, providing correct parse trees for most inputs.
LL_EXACT_AMBIG_DETECTION: Guarantees exhaustive ambiguity detection but offers no guarantees for syntactically invalid inputs.
5. Execution Result
The helper prints a list of tables found in the sample query, e.g., [TAB1, TAB2].
6. Reference
For deeper understanding, see “The Definitive ANTLR 4 Reference” by Terence Parr.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
