How SQL Engines Turn Queries into Results: Inside ASTs and ANTLR4 Parsing
This article explores the inner workings of SQL engines, detailing the end‑to‑end process from query submission through lexical analysis, parsing, abstract syntax tree generation with ANTLR4, and subsequent logical and physical plan optimization, while also demonstrating practical Java examples for SQL parsing and rewriting.
1. Introduction
We often write various SQL statements, from simple to complex, and expect the database to return the desired result set. For example, the query select * from t_user where user_id > 10; retrieves all records with user_id greater than 10. This article asks what actually happens between submitting a SQL statement and obtaining the result set.
2. SQL Engine Overview
From MySQL, Oracle, TiDB, ClickHouse to Hive, HBase, Spark, most relational and big‑data engines rely on a SQL engine that accepts a SQL string and returns query results. Their core execution logic follows a similar workflow, typically consisting of lexical analysis, syntax analysis (producing an abstract syntax tree, AST), logical plan generation, optimization, and physical plan execution.
The blue part of the diagram represents the basic workflow of a SQL engine, where lexical and syntax analysis lead to the creation of an AST.
3. Abstract Syntax Tree (AST)
3.1 Concept
High‑level language parsing relies on a parse tree; an AST is a simplified version that discards some syntactic details, representing the source code structure as a tree of nodes (ASTNode). Each node corresponds to a language construct, and different languages have their own AST implementations.
When a SQL statement is submitted, it first undergoes lexical analysis (tokenization) and then syntax analysis, which builds an AST.
3.2 ANTLR4
Many SQL engines use ANTLR4 for lexical and syntax parsing. ANTLR4 works by defining a .g4 grammar file that describes the language using regular expressions and specific syntax rules. The generated parser then tokenizes and parses SQL statements according to this grammar.
3.3 Example: Parsing SQL in Java
3.3.1 Dependency
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.7</version>
</dependency>Install the ANTLR4 plugin in IDEA, then use the generated Java classes (e.g., SqlBaseLexer, SqlBaseParser) to parse a SQL statement and extract table names.
The generated parser creates an AST; visitors (e.g., HelloVisitor) or listeners can traverse the tree to retrieve information such as table identifiers.
By overriding methods in SqlBaseBaseVisitor, you can extract specific nodes like TableIdentifierContext that represent table definitions.
4. AST‑Based SQL Parser and Rewriting
Using ANTLR4 to generate and parse an AST is a low‑level approach; in production, many projects adopt existing tools for parsing.
Popular Java SQL parsers include:
fdb‑sql‑parser (FoundationDB, now unmaintained)
jsqlparser (based on JavaCC)
Apache Calcite (dynamic data‑management framework, strong standard‑SQL support, weaker dialect support)
Alibaba Druid (JDBC pool with built‑in SQL parser supporting multiple dialects)
Apache ShardingSphere and Mycat (both use Druid’s parser)
4.1 Application Scenarios
Syntax validation: enforce rules and check legality of SQL.
Query optimization: analyze WHERE, GROUP BY, JOIN, etc., and suggest index improvements.
SQL rewriting: add, remove, or modify AST nodes to transform the query.
SQL fingerprinting: generate unique signatures for slow‑query analysis.
4.2 SQL Rewriting via AST
Instead of simple string replacement or regex, rewriting via AST ensures syntactic correctness. The process involves generating an AST from the SQL string, modifying target nodes, and converting the AST back to a SQL string.
Using Druid’s SQLUtils class, you can implement such rewrites.
4.2.1 Adding a Flag Column
Original SQL:
After AST‑based rewriting, the query automatically includes an environment flag (e.g., flag='pre‑prod' or flag='prod') without manual changes.
4.2.2 Query Rewriting Examples
Simple JOIN:
JOIN with implicit WHERE:
Complex UNION + JOIN + sub‑query:
5. Conclusion
This research, conducted in an isolated environment, demonstrates that AST‑based SQL rewriting can be used for data‑isolation purposes. By deploying a plugin that intercepts SQL before it reaches MySQL, developers can automatically add environment‑specific flags to every CRUD statement, ensuring that each application only accesses data belonging to its own environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
