Mastering SQL LIKE: From Regex to State Machine for High‑Performance Matching
This article examines how major SQL parsers like ANTLR and Calcite implement the LIKE operator, compares regex‑based, naïve, and finite‑state‑machine approaches, and presents optimizations that achieve linear‑time performance for most pattern matching scenarios.
Introduction
When optimizing the LIKE operator in SQL, the author examines how mainstream parsers such as ANTLR and Calcite implement the LIKE syntax, compares several implementation strategies, and finally adopts a state‑machine based solution with step‑by‑step performance optimizations.
ANTLR and Calcite
ANTLR is a powerful parser generator widely used in big‑data SQL frameworks (e.g., Hive, Presto). It does not provide a concrete LIKE implementation. Calcite builds on ANTLR, offering a standard SQL language, query optimization, and a pluggable architecture. The article shows Calcite’s implementation of LIKE using Java regex.
/** SQL {@code LIKE} function. */
public static boolean like(String s, String pattern) {
final String regex = Like.sqlToRegexLike(pattern, null);
return Pattern.matches(regex, s);
}
/** Translates a SQL LIKE pattern to Java regex pattern. */
static String sqlToRegexLike(String sqlPattern, char escapeChar) {
// conversion logic …
}Other Implementations
Examples from other compilers (e.g., TDDL) also rely on regex conversion. A straightforward Java regex version is presented, but its performance suffers due to repeated replace operations.
public static boolean like(final String dest, final String pattern) {
String regex = regexParse(pattern);
regex = regex.replace("_", ".").replace("%", ".*?");
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
return p.matcher(dest).matches();
}State‑Machine Approach
The author designs a finite‑state machine (FSM) with three components—state, event, action. Various implementation techniques (enumeration, lookup table, state pattern) are discussed. The FSM is built by parsing the pattern into a linked list (compile O(n)) and matching with O(n·m) complexity, later optimized to O(n) by eliminating backtracking.
public void compile(final String pattern) {
LikeStateMachine machine = LikeStateMachine.build(pattern);
}Optimizations
Several optimizations are applied: caching compiled Pattern objects, using lazy or exclusive regex modes, handling common cases (equality, startsWith, endsWith, contains) directly, and classifying patterns into LEFT, RIGHT, SURROUND types to avoid backtracking. These changes reduce matching time to linear complexity for most scenarios.
Benchmark Results
Performance tests on typical short and long patterns show that the FSM implementation consistently outperforms the generic regex and double‑pointer algorithms, especially as the input string grows.
Conclusion
Considering maintainability, extensibility, and speed, the state‑machine implementation is preferred over direct regex or naïve algorithms for SQL LIKE matching.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
