Mastering SQL LIKE: From Regex to State Machine for High‑Performance Matching

This article examines how major SQL parsers like ANTLR and Calcite implement the LIKE operator, compares regex‑based, naïve, and finite‑state‑machine approaches, and presents optimizations that achieve linear‑time performance for most pattern matching scenarios.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Mastering SQL LIKE: From Regex to State Machine for High‑Performance Matching

Introduction

When optimizing the LIKE operator in SQL, the author examines how mainstream parsers such as ANTLR and Calcite implement the LIKE syntax, compares several implementation strategies, and finally adopts a state‑machine based solution with step‑by‑step performance optimizations.

ANTLR and Calcite

ANTLR is a powerful parser generator widely used in big‑data SQL frameworks (e.g., Hive, Presto). It does not provide a concrete LIKE implementation. Calcite builds on ANTLR, offering a standard SQL language, query optimization, and a pluggable architecture. The article shows Calcite’s implementation of LIKE using Java regex.

/** SQL {@code LIKE} function. */
public static boolean like(String s, String pattern) {
    final String regex = Like.sqlToRegexLike(pattern, null);
    return Pattern.matches(regex, s);
}

/** Translates a SQL LIKE pattern to Java regex pattern. */
static String sqlToRegexLike(String sqlPattern, char escapeChar) {
    // conversion logic …
}

Other Implementations

Examples from other compilers (e.g., TDDL) also rely on regex conversion. A straightforward Java regex version is presented, but its performance suffers due to repeated replace operations.

public static boolean like(final String dest, final String pattern) {
    String regex = regexParse(pattern);
    regex = regex.replace("_", ".").replace("%", ".*?");
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    return p.matcher(dest).matches();
}

State‑Machine Approach

The author designs a finite‑state machine (FSM) with three components—state, event, action. Various implementation techniques (enumeration, lookup table, state pattern) are discussed. The FSM is built by parsing the pattern into a linked list (compile O(n)) and matching with O(n·m) complexity, later optimized to O(n) by eliminating backtracking.

public void compile(final String pattern) {
    LikeStateMachine machine = LikeStateMachine.build(pattern);
}

Optimizations

Several optimizations are applied: caching compiled Pattern objects, using lazy or exclusive regex modes, handling common cases (equality, startsWith, endsWith, contains) directly, and classifying patterns into LEFT, RIGHT, SURROUND types to avoid backtracking. These changes reduce matching time to linear complexity for most scenarios.

Benchmark Results

Performance tests on typical short and long patterns show that the FSM implementation consistently outperforms the generic regex and double‑pointer algorithms, especially as the input string grows.

Conclusion

Considering maintainability, extensibility, and speed, the state‑machine implementation is preferred over direct regex or naïve algorithms for SQL LIKE matching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceSQLstate machineregexlike
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.