Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse
Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.
This article introduces the overall design principles, optimizations, and practical applications of Baidu's data‑warehouse fusion compute engine. It explains how, under the rapid iteration of internet products, a single‑layer wide‑table model can achieve ten‑second query latency.
Business Background : Internet companies generate massive data (hundreds of PB) across many product lines, requiring a stable and efficient compute engine for ad‑hoc analysis and ETL tasks.
Data Evolution and Engine Selection : The evolution from single‑machine analysis → MapReduce/Hive (disk‑based, minutes) → Spark (memory‑based, seconds) led to the selection of Spark SQL for its Hive compatibility, large‑scale join performance, columnar storage support, and UDF extensibility.
Problems Faced : High query latency, large storage consumption, redundant tables, and increasing data‑driven business demands.
Technical Solution :
Design of a fusion compute engine built on Apache Spark, consisting of WebServer, Master, and Worker (Worker developed by second‑stage Spark source code with a resident Container for resource reuse).
Performance optimizations:
ETL support: The engine naturally handles routine ETL workloads, supporting single‑statement, multi‑statement, and complex SparkSQL syntax.
Performance Results : Compared with ordinary Spark, ad‑hoc queries are five times faster (tens of seconds), ETL jobs use 20% fewer resources and run four times faster, overall storage reduced by ~30%, and query performance improved by ~300%.
Conclusion : The fusion compute engine combined with a one‑layer wide‑table model greatly enhances data‑driven business efficiency, reduces storage costs, and delivers high‑performance, stable query processing for massive data warehouses.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.