Databases 16 min read

How ANTLR Powers SQL Parsing in Sharding‑Sphere: From Lexical Analysis to Sharding Context

This article explains the fundamentals of SQL parsing, compares popular parsers, details how ANTLR defines lexical and syntactic rules for SQL, and shows how Sharding‑Sphere leverages custom and ANTLR‑based parsers to extract sharding‑relevant context while balancing performance and compatibility.

dbaplus Community
dbaplus Community
dbaplus Community
How ANTLR Powers SQL Parsing in Sharding‑Sphere: From Lexical Analysis to Sharding Context

1. Concepts of SQL Parsing

SQL, although simpler than many programming languages, still requires lexical analysis (tokenizing) and syntactic analysis (building an abstract syntax tree, AST). The lexer splits the input into tokens such as keywords, identifiers, literals, operators, and delimiters; the parser then assembles these tokens according to grammar rules to produce an AST.

Common open‑source SQL parsers include JSQLParser, FDB, and Druid. JSQLParser provides a one‑stop solution but cannot generate custom parsers, supports only a subset of dialects, and hides the AST behind a Visitor pattern, making direct tree access cumbersome.

2. ANTLR Overview

ANTLR (Another Tool for Language Recognition) is a Java‑based parser generator that can produce parsers for many target languages (Java, Go, C, etc.). It uses separate lexer and parser grammars, follows BNF‑style rules, and supports custom token definitions.

Key grammar elements:

Lexer : defines token patterns (e.g., SELECT: [Ss][Ee][Ll][Ee][Cc][Tt];).

Parser : defines how tokens combine into language constructs.

Tree : provides visitor interfaces to traverse the AST.

Combine : can contain both lexer and parser rules.

Example lexer grammar for a simple SELECT statement:

lexer grammar SelectLexer;
SELECT: [Ss][Ee][Ll][Ee][Cc][Tt];
FROM:   [Ff][Rr][Oo][Mm];
WHERE:  [Ww][Hh][Ee][Rr][Ee];
ID:     [a-zA-Z0-9]+;
WS:     [ \t
]+ -> skip;

ANTLR matches tokens based on rule order; earlier rules have higher priority. Missing whitespace rules cause lexical errors.

3. ANTLR Grammar Details

Grammar files use uppercase names for lexer rules and lowercase for parser rules. Rules end with a semicolon, and alternatives are separated by ‘|’. Operators ‘*’, ‘+’, and ‘|’ denote repetition, optionality, and branching.

Example parser grammar demonstrating greedy matching and rule precedence:

grammar Test;
ID: [a-zA-Z0-9]+;
WS: [ \t
]+ -> skip;

testAll: test1 | test2 | test3 | test21;

test1: ID;

test2: ID ID;

test3: ID ID ID;

test21: ID ID;

test4: test1+;

When parsing “a1 a2 a3”, the longest matching rule (test3) is chosen; for “a1 a2”, test2 wins over test21 because it appears earlier; an unrecognizable token (e.g., “#”) triggers an error.

4. Sharding‑Sphere SQL Parsing

Sharding‑Sphere originally used Druid as its SQL parser (pre‑1.5.x) and later switched to a custom parser that performs a “half‑understanding” of SQL, extracting only the context needed for data‑sharding (select items, table names, sharding conditions, primary keys, ORDER BY, GROUP BY, LIMIT, etc.).

From version 3.x, Sharding‑Sphere experiments with ANTLR as the parsing engine, aiming to replace its internal parser step‑by‑step following the order DDL → TCL → DAL → DCL → DML → DQL. While ANTLR improves compatibility with complex statements (recursive queries, window functions), its performance is roughly three times slower than the handcrafted parser.

To mitigate the slowdown, Sharding‑Sphere caches the AST generated from prepared‑statement SQL and offers a configuration that lets users choose between the fast custom parser and the more compatible ANTLR parser.

5. Visual Example

Below is an illustration of an AST for a sample SELECT statement, where keywords appear in green, variables in red, and nodes requiring further splitting in gray.

AST diagram
AST diagram

6. Summary of Parsing Process

The overall parsing workflow consists of:

Lexical analysis : Tokenize the SQL string into keywords, identifiers, literals, operators, and delimiters.

Syntactic analysis : Apply grammar rules (custom or ANTLR‑generated) to build an AST, handling branching, recursion, and greedy matching.

Context extraction : Traverse the AST to collect sharding‑relevant information such as tables, conditions, and pagination.

This approach enables Sharding‑Sphere to achieve high performance while maintaining sufficient SQL compatibility for most sharding scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ANTLRparsingParserlexersharding-sphere
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.