Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding
This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.
The presentation introduces several Spark SQL expression optimizations aimed at improving query efficiency and reducing resource consumption in large‑scale data processing.
Big Data Product Overview
CyberEngine, the core of the DataCyber platform, provides a unified metadata service and supports resource managers such as K8S and Yarn. It integrates multiple machine‑learning frameworks and streaming/batch engines (e.g., Spark, Flink) and offers its own SQL engine (CyberSQL) and scheduler (CyberScheduler).
The focus of the session is on Spark‑related performance and stability enhancements within CyberEngine, with similar stability work also applied to Flink.
Spark SQL Course Recap
Previous lectures covered Spark core principles, the Spark SQL execution pipeline, and detailed analyses of the parsing and analysis layers, setting the stage for today’s discussion on expression‑level optimizations.
Analysis Layer Principles
Expressions are not created during the parsing stage; they are generated in the analysis layer when the parser converts the unresolved logical plan into a fully typed logical plan using components like Catalog Manager and function registries.
LIKE ALL and LIKE ANY Optimization
In SQL, LIKE performs pattern matching. LIKE ALL requires a value to match every pattern, while LIKE ANY succeeds if the value matches any one pattern, allowing short‑circuit evaluation.
Spark SQL originally implemented these operators by creating separate LIKE expression objects and combining them with AND , which leads to high memory consumption, frequent garbage collection, and deep expression trees when many patterns are involved.
The new design introduces a base class LikeAllBase that stores all patterns in a single sequence (using UTF8String for efficient string handling). Its eval method checks for nulls and then iterates over the patterns, avoiding the creation of multiple objects and reducing stack depth.
TRIM Function Improvements
Both the standard TRIM and its variants TRIM LEFT and TRIM RIGHT share similar logic. The refactor extracts the common logic into a parent class String2TrimExpression , eliminating duplicated code across the three functions.
Additionally, Spark introduced a doeval helper that generates a flat Java class to execute the trimming logic, reducing method‑call overhead and improving JVM performance.
Constant Folding
Constant folding evaluates constant expressions (e.g., 1 + 2 ) during the query‑optimization phase, replacing them with their computed values so they are not recomputed for each row.
In Spark SQL, this optimization applies to functions like array_insert . When the index argument is a constant expression such as 2 + 3 , Spark computes the index once during planning and caches the result, avoiding repeated evaluation during execution.
These optimizations collectively reduce memory usage, lower garbage‑collection pressure, and improve overall query performance in big‑data environments.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.