ByteDance’s Core Optimization Practices on Spark SQL
ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.
ByteDance data warehouse architecture leader Guo Jun presents the core optimization practices applied to Spark SQL across ByteDance’s product lines.
The team designs data‑warehouse architecture for almost all ByteDance products, supporting Spark SQL and Druid development and optimization.
The presentation is divided into three parts: an overview of Spark SQL architecture, engine‑level optimizations, and shuffle stability and performance enhancements.
Spark SQL Architecture: After a SQL statement is submitted, it is parsed into an Unresolved Logical Plan, enriched via the Catalog (typically Hive Metastore), resolved by the Analyzer, optimized by the Optimizer through rule‑based transformations, and finally turned into one or more Physical Plans by the Query Planner. A cost model selects the best Physical Plan, and Adaptive Execution can adjust plans at runtime.
Engine Optimizations – Bucket Join Improvements: The team identifies the high cost of Shuffle in Sort‑Merge Join and introduces four major improvements to Bucket Join: (1) Hive compatibility mode to align bucket files between Hive and Spark SQL; (2) support for bucket count multiples, enabling joins when bucket numbers are in a ratio; (3) degradation handling that falls back to non‑bucket joins for older partitions; and (4) superset support allowing joins when the join key set contains the bucket key set.
Materialized Columns: To avoid costly reads of nested columnar data (e.g., Map fields), a primitive column is added as a materialized field. During data ingestion the materialized value is generated automatically, and query rewriting transparently replaces nested field accesses with the primitive column, eliminating unnecessary I/O, enabling vectorized reads, filter push‑down, and reducing repeated JSON parsing.
Materialized Views: Frequently used Group‑By/Aggregate queries are pre‑computed into materialized views, allowing Spark SQL to rewrite user queries to read from these views, thereby removing redundant computation.
Other Spark SQL Engine Optimizations: Additional improvements (illustrated in diagrams) further enhance execution efficiency.
Shuffle Stability and Performance: The traditional mapper‑centric shuffle writes data locally, causing stability issues when a node fails and performance bottlenecks due to random reads. The team first tried writing shuffle outputs directly to HDFS, then introduced a hybrid approach where shuffle data is uploaded to HDFS after local write and read from HDFS on failure, covering >57% of shuffle data and improving overall job performance by 14% (18% for day‑scale jobs, 12% for hour‑scale jobs). They also propose a reducer‑centric shuffle service architecture that writes each reducer’s data to dedicated shuffle services, turning massive random I/O into sequential I/O and enabling storage‑compute separation.
Results: The HDFS‑based solution and reducer‑centric service reduce stage retries, improve job stability, and deliver measurable performance gains.
Conclusion: The talk summarizes practical, production‑grade optimizations for Spark SQL that enhance query speed, resource utilization, and reliability in large‑scale big‑data workloads.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.