Big Data 16 min read

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

DataFunSummit

Dec 9, 2024

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

The presentation introduces several Spark SQL expression optimizations aimed at improving query efficiency and reducing resource consumption in large‑scale data processing.

Big Data Product Overview

CyberEngine, the core of the DataCyber platform, provides a unified metadata service and supports resource managers such as K8S and Yarn. It integrates multiple machine‑learning frameworks and streaming/batch engines (e.g., Spark, Flink) and offers its own SQL engine (CyberSQL) and scheduler (CyberScheduler).

The focus of the session is on Spark‑related performance and stability enhancements within CyberEngine, with similar stability work also applied to Flink.

Spark SQL Course Recap

Previous lectures covered Spark core principles, the Spark SQL execution pipeline, and detailed analyses of the parsing and analysis layers, setting the stage for today’s discussion on expression‑level optimizations.

Analysis Layer Principles

Expressions are not created during the parsing stage; they are generated in the analysis layer when the parser converts the unresolved logical plan into a fully typed logical plan using components like Catalog Manager and function registries.

LIKE ALL and LIKE ANY Optimization

In SQL, LIKE performs pattern matching. LIKE ALL requires a value to match every pattern, while LIKE ANY succeeds if the value matches any one pattern, allowing short‑circuit evaluation.

Spark SQL originally implemented these operators by creating separate LIKE expression objects and combining them with AND, which leads to high memory consumption, frequent garbage collection, and deep expression trees when many patterns are involved.

The new design introduces a base class LikeAllBase that stores all patterns in a single sequence (using UTF8String for efficient string handling). Its eval method checks for nulls and then iterates over the patterns, avoiding the creation of multiple objects and reducing stack depth.

TRIM Function Improvements

Both the standard TRIM and its variants TRIM LEFT and TRIM RIGHT share similar logic. The refactor extracts the common logic into a parent class String2TrimExpression, eliminating duplicated code across the three functions.

Additionally, Spark introduced a doeval helper that generates a flat Java class to execute the trimming logic, reducing method‑call overhead and improving JVM performance.

Constant Folding

Constant folding evaluates constant expressions (e.g., 1 + 2) during the query‑optimization phase, replacing them with their computed values so they are not recomputed for each row.

In Spark SQL, this optimization applies to functions like array_insert. When the index argument is a constant expression such as 2 + 3, Spark computes the index once during planning and caches the result, avoiding repeated evaluation during execution.

These optimizations collectively reduce memory usage, lower garbage‑collection pressure, and improve overall query performance in big‑data environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data performance tuning Spark SQL constant folding Expression Optimization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.