Big Data 10 min read

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

This article introduces the overall design principles, optimizations, and practical applications of Baidu's data‑warehouse fusion compute engine. It explains how, under the rapid iteration of internet products, a single‑layer wide‑table model can achieve ten‑second query latency.

Business Background : Internet companies generate massive data (hundreds of PB) across many product lines, requiring a stable and efficient compute engine for ad‑hoc analysis and ETL tasks.

Data Evolution and Engine Selection : The evolution from single‑machine analysis → MapReduce/Hive (disk‑based, minutes) → Spark (memory‑based, seconds) led to the selection of Spark SQL for its Hive compatibility, large‑scale join performance, columnar storage support, and UDF extensibility.

Problems Faced : High query latency, large storage consumption, redundant tables, and increasing data‑driven business demands.

Technical Solution :

Design of a fusion compute engine built on Apache Spark, consisting of WebServer, Master, and Worker (Worker developed by second‑stage Spark source code with a resident Container for resource reuse).

Performance optimizations:

ETL support: The engine naturally handles routine ETL workloads, supporting single‑statement, multi‑statement, and complex SparkSQL syntax.

Performance Results : Compared with ordinary Spark, ad‑hoc queries are five times faster (tens of seconds), ETL jobs use 20% fewer resources and run four times faster, overall storage reduced by ~30%, and query performance improved by ~300%.

Conclusion : The fusion compute engine combined with a one‑layer wide‑table model greatly enhances data‑driven business efficiency, reduces storage costs, and delivers high‑performance, stable query processing for massive data warehouses.

OptimizationBig DataData WarehouseSparkBaiduFusion Compute Engine
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.