Databases 11 min read

How SQL Engines Turn Queries into Results: Inside ASTs and ANTLR4 Parsing

This article explores the inner workings of SQL engines, detailing the end‑to‑end process from query submission through lexical analysis, parsing, abstract syntax tree generation with ANTLR4, and subsequent logical and physical plan optimization, while also demonstrating practical Java examples for SQL parsing and rewriting.

JD Cloud Developers

Oct 18, 2022

How SQL Engines Turn Queries into Results: Inside ASTs and ANTLR4 Parsing

1. Introduction

We often write various SQL statements, from simple to complex, and expect the database to return the desired result set. For example, the query select * from t_user where user_id > 10; retrieves all records with user_id greater than 10. This article asks what actually happens between submitting a SQL statement and obtaining the result set.

2. SQL Engine Overview

From MySQL, Oracle, TiDB, ClickHouse to Hive, HBase, Spark, most relational and big‑data engines rely on a SQL engine that accepts a SQL string and returns query results. Their core execution logic follows a similar workflow, typically consisting of lexical analysis, syntax analysis (producing an abstract syntax tree, AST), logical plan generation, optimization, and physical plan execution.

The blue part of the diagram represents the basic workflow of a SQL engine, where lexical and syntax analysis lead to the creation of an AST.

3. Abstract Syntax Tree (AST)

3.1 Concept

High‑level language parsing relies on a parse tree; an AST is a simplified version that discards some syntactic details, representing the source code structure as a tree of nodes (ASTNode). Each node corresponds to a language construct, and different languages have their own AST implementations.

When a SQL statement is submitted, it first undergoes lexical analysis (tokenization) and then syntax analysis, which builds an AST.

3.2 ANTLR4

Many SQL engines use ANTLR4 for lexical and syntax parsing. ANTLR4 works by defining a .g4 grammar file that describes the language using regular expressions and specific syntax rules. The generated parser then tokenizes and parses SQL statements according to this grammar.

3.3 Example: Parsing SQL in Java

3.3.1 Dependency

<dependency>
  <groupId>org.antlr</groupId>
  <artifactId>antlr4-runtime</artifactId>
  <version>4.7</version>
</dependency>

Install the ANTLR4 plugin in IDEA, then use the generated Java classes (e.g., SqlBaseLexer, SqlBaseParser) to parse a SQL statement and extract table names.

The generated parser creates an AST; visitors (e.g., HelloVisitor) or listeners can traverse the tree to retrieve information such as table identifiers.

By overriding methods in SqlBaseBaseVisitor, you can extract specific nodes like TableIdentifierContext that represent table definitions.

4. AST‑Based SQL Parser and Rewriting

Using ANTLR4 to generate and parse an AST is a low‑level approach; in production, many projects adopt existing tools for parsing.

Popular Java SQL parsers include:

fdb‑sql‑parser (FoundationDB, now unmaintained)

jsqlparser (based on JavaCC)

Apache Calcite (dynamic data‑management framework, strong standard‑SQL support, weaker dialect support)

Alibaba Druid (JDBC pool with built‑in SQL parser supporting multiple dialects)

Apache ShardingSphere and Mycat (both use Druid’s parser)

4.1 Application Scenarios

Syntax validation: enforce rules and check legality of SQL.

Query optimization: analyze WHERE, GROUP BY, JOIN, etc., and suggest index improvements.

SQL rewriting: add, remove, or modify AST nodes to transform the query.

SQL fingerprinting: generate unique signatures for slow‑query analysis.

4.2 SQL Rewriting via AST

Instead of simple string replacement or regex, rewriting via AST ensures syntactic correctness. The process involves generating an AST from the SQL string, modifying target nodes, and converting the AST back to a SQL string.

Using Druid’s SQLUtils class, you can implement such rewrites.

4.2.1 Adding a Flag Column

Original SQL:

After AST‑based rewriting, the query automatically includes an environment flag (e.g., flag='pre‑prod' or flag='prod') without manual changes.

4.2.2 Query Rewriting Examples

Simple JOIN:

JOIN with implicit WHERE:

Complex UNION + JOIN + sub‑query:

5. Conclusion

This research, conducted in an isolated environment, demonstrates that AST‑based SQL rewriting can be used for data‑isolation purposes. By deploying a plugin that intercepts SQL before it reaches MySQL, developers can automatically add environment‑specific flags to every CRUD statement, ensuring that each application only accesses data belonging to its own environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java sql AST parsing Antlr4

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.