How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data
This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.
Data Landscape and Evolution of Analysis Engines
Internet companies generate hundreds of petabytes of data daily across multiple product lines. Analysts, product managers, operations staff and data engineers all require fast, stable compute for ad‑hoc queries and ETL jobs. The industry progressed from single‑node TB‑scale analysis to MapReduce/Hive (PB‑scale, minutes per query) and finally to Apache Spark (hundreds of PB, seconds per query). Baidu evaluated engines on Hive compatibility, large‑join performance, columnar storage, concurrency and overall suitability, selecting Spark SQL for its strong Hive ecosystem integration and wide‑table columnar support.
Business Trends and Challenges
Rapid product iteration and data‑driven decision making increase cross‑business analysis needs. Daily ad‑hoc workloads now touch tens of petabytes and ETL workloads are of similar scale. Users are shifting from data engineers to analysts and product/operations staff, demanding lower query latency and simpler usage.
Fusion Compute Engine Overview
Architecture
The Fusion Compute Engine is Baidu’s internally built SQL engine on top of Apache Spark. It consists of three components:
WebServer : receives SQL requests and forwards them to the scheduler.
Master : coordinates job planning, resource allocation and fault tolerance.
Worker : Spark‑based executors wrapped in a custom container that enables long‑lived resource reuse.
Performance Optimizations
Data Skipping (Reducing I/O)
Partition Skipping : Only partitions that satisfy the query predicate are read, providing the largest speedup.
Parquet Columnar Storage : Wide tables (500‑1300 columns, typically <15 columns queried) benefit from column‑level encoding and compression.
RowGroup Statistics : Parquet footers store min/max, sum and Bloom filter per RowGroup, enabling predicate push‑down to skip irrelevant groups.
Parquet ColumnIndex (Spark 3.2+) : Page‑level min/max filtering after sorting data reduces I/O dramatically; in production this cut I/O from TB to GB and lowered query time by 43% for a daily increase of 3 trillion rows.
Compute Speed Enhancements
ProjectLimit Push‑Down : For queries with LIMIT, unnecessary shuffle (Exchange) stages are removed, turning minute‑level scans into second‑level responses (≈100× speedup).
CodeGen : Dynamic Java code generation with JIT compilation replaces interpreted execution for expressions and whole‑stage computation, delivering order‑of‑magnitude gains.
Vectorized Columnar Reads : Reduces virtual‑function overhead; Baidu replaced Spark’s default LIKE and JSON operators with hyperscan and simdJson, adding an extra ~12% query speedup.
Stability Tuning
Beyond Spark’s built‑in task‑level retry, Baidu tuned dozens of configuration parameters (heap/off‑heap memory, concurrency, shuffle, speculation, scheduling, serialization, file I/O) to achieve both high performance and robust production stability.
ETL Use Cases
Because the engine is Spark‑based, it naturally supports complex multi‑statement SparkSQL scripts, making it well‑suited for routine ETL pipelines.
Benefits and Performance Results
Unified query and ETL engine reduces learning curve and operational cost.
Supports ultra‑large tables (hundreds of PB) with daily growth of 3 trillion rows.
Ad‑hoc query latency drops to tens of seconds, roughly 5× faster than vanilla Spark.
ETL workloads see ~20% resource savings and 4× faster execution.
Overall storage reduced by ~30% and query performance improved by ~300% thanks to the fusion engine combined with a single wide‑table model.
Conclusion
The Fusion Compute Engine together with a one‑layer wide‑table model matches the needs of fast‑iteration, data‑driven businesses, delivering substantial efficiency gains. While storage and query performance improve dramatically, the approach increases engine pressure and data‑model maintenance costs, indicating ongoing optimization opportunities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
