Databases 17 min read

Why Hand‑Written SQL Parsers Outperform Auto‑Generated Ones: Design & Performance Insights

This article examines the trade‑offs between automatically generated and manually crafted SQL parsers, detailing lexical‑syntax analysis techniques, performance challenges in OLAP workloads, and the authors’ custom parser design that achieves up to 30‑50× speed gains in complex queries and bulk inserts.

Alibaba Cloud Developer

Jun 10, 2021

Why Hand‑Written SQL Parsers Outperform Auto‑Generated Ones: Design & Performance Insights

SQL (Structured Query Language) is a domain‑specific language originally used for relational databases to manage structured data. It consists of DDL, DCL, and DML, and each database implements it differently. A lexical‑syntax analyzer parses SQL text through lexical analysis, syntax analysis, and semantic analysis, builds an AST, which is then optimized into an execution plan and executed by the engine. The diagram (MySQL architecture) shows where the analyzer fits.

This article introduces lexical‑syntax analysis techniques, industry practices, problems encountered with auto‑generated parsers, and shares the design and practice of a self‑developed SQL parser that brings performance and functional improvements.

1. How Industry Products Develop SQL Parsers

Parser development can be divided into two approaches:

1.1 Automatic Generation

Tools such as Flex, Lex, Bison (C/C++), ANTLR, and JavaCC (Java) generate lexical and syntax code from grammar files written in EBNF. They use LL(k) parsing algorithms to build the AST. Many databases and big‑data systems (e.g., Presto, Spark, Hive) adopt this method.

1.2 Hand‑Written

Popular databases like InfluxDB, H2, and ClickHouse implement their SQL parsers manually.

Clear code logic, easier debugging.

Better performance through custom algorithms and data structures.

Fully controllable, no licensing constraints, higher readability and maintainability.

No dependency on third‑party generation tools.

Drawbacks:

Higher technical requirements; developers must understand compiler theory.

Large development effort to cover all MySQL syntax branches.

Requires extensive testing to reach stability.

2. Problems and Challenges

2.1 Complex Query Performance – In real‑time analytical databases, thousands of complex or deeply nested queries cause deep thread stacks in auto‑generated parsers, leading to severe performance degradation during lexical‑syntax analysis.

2.2 Bulk‑Insert Throughput – High‑concurrency write scenarios generate many temporary AST objects, increasing GC overhead. Auto‑generated parsers also create many objects for the VALUES clause.

2.3 Query Rewriting Flexibility – Efficient AST traversal and modification are needed for query rewriting; auto‑generated parsers lack this flexibility.

3. Technical Highlights of the Self‑Developed Parser

3.1 Lexical & Syntax Analysis

Manual parsers use simple branch prediction instead of state‑machine DFA, reducing computation and call‑stack depth. Tokens are identified (e.g., identifiers, literals, keywords) and fed into a top‑down parser that builds the AST.

Fast Token Comparison

SELECT c1 FROM T1;

Keywords are case‑insensitive; they are stored in a map with uppercase keys for O(1) lookup, using ASCII‑based conversion for speed.

Fast Numeric Analysis

Numeric literals are parsed directly with Integer.parseInt, Float.parseFloat, etc., avoiding intermediate string objects and reducing memory usage.

Avoiding Backtracking

Only one look‑ahead token is usually needed; when more are required, the parser saves the token position to allow quick rollback, minimizing backtracking overhead.

Expression Replacement

Each AST node implements a Replaceable interface, enabling in‑place subtree replacement for query rewriting without rebuilding the whole tree.

public interface Replaceable {
    boolean replace(Node expr, Node target);
}

Other Optimizations

AST cloning for safe modifications (e.g., adding hints, pruning predicates).

Bidirectional parent‑child links for faster traversal.

public abstract class Node {
    public abstract List<Node> getChildren();
}

public class BetweenNode extends Node {
    public Node beginExpr;
    public Node endExpr;
    @Override
    public List<Node> getChildren() {
        return Arrays.asList(beginExpr, endExpr);
    }
    @Override
    public BetweenNode clone() {
        BetweenNode x = new BetweenNode();
        if (beginExpr != null) x.setBeginExpr(beginExpr.clone());
        if (endExpr != null) x.setEndExpr(endExpr.clone());
        return x;
    }
    public void setBeginExpr(Node beginExpr) {
        if (beginExpr != null) beginExpr.setParent(this);
        this.beginExpr = beginExpr;
    }
    public void setEndExpr(Node endExpr) {
        if (endExpr != null) endExpr.setParent(this);
        this.endExpr = endExpr;
    }
}

3.4 Semantic Analysis & Write‑Event Callbacks

During lexical analysis, a user‑provided InsertValueHandler can process each value directly, avoiding creation of intermediate AST nodes for bulk inserts.

public interface InsertValueHandler {
    Object newRow() throws SQLException;
    void processInteger(Object row, int index, Number value);
    void processString(Object row, int index, String value);
    // ... other type handlers ...
    void processComplete();
}

3.5 Query Rewriting

Manual parsers integrate rewriting rules (constant folding, function conversion, predicate push‑down) directly on the AST, achieving significant speedups.

-- Constant folding example
SELECT * FROM T1 WHERE c_week BETWEEN CAST(date_format(date_add('day', -day_of_week('20180605'), date('20180605')), '%Y%m%d') AS BIGINT)
    AND CAST(date_format(date_add('day', -day_of_week('20180606'), date('20180606')), '%Y%m%d') AS BIGINT);

-- After folding
SELECT * FROM T1 WHERE c_week BETWEEN 20180602 AND 20180603;

4. Conclusion

Performance tests on TPC‑DS (99 queries) show the hand‑written parser is 20× faster than an ANTLR‑generated parser and 30× faster than a JavaCC‑generated parser; in bulk‑insert scenarios the speedup reaches 30‑50×.

The article demonstrates that, for OLAP workloads with high throughput and complex SQL, a manually crafted parser offers superior performance, easier extensibility for semantic analysis, and better stability.

References [1] Pattis, Richard E. "EBNF: A Notation to Describe Syntax". [2] Parr, Terence and Fisher, Kathleen. "LL(*) the foundation of the ANTLR parser generator". [3] Rosenkrantz, D. J.; Stearns, R. E. "Properties of Deterministic Top Down Grammars". [4] Gurari, Eitan. "CIS 680: DATA STRUCTURES: Chapter 19: Backtracking Algorithms". [5] Pirahesh, Hamid; Hellerstein, Joseph M. "Extensible/Rule Based Query Rewrite Optimization in Starburst".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance SQL Database lexical analysis Parser

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.