Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine
This article outlines Baidu’s innovative approach to building its search data platform, detailing the design of wide‑table models, the upgrade to a Spark‑based fusion computation engine, and the new Turing 3.0 service delivery framework, which together deliver higher efficiency, lower cost, and faster, more reliable analytics.
Overview
The article presents Baidu’s innovative construction of its search data platform, focusing on three main directions: wide‑table model design, computation engine optimization, and the next‑generation service delivery model (Turing 3.0). These improvements address the challenges of traditional data warehouses in search scenarios, achieving high efficiency, stability, and low cost.
Key Innovations
Wide‑Table Model
A theme‑based wide‑table model is built by keeping the ODS and DWD layer granularity consistent, integrating all downstream fields, dimensions, and metrics into a single table. This eliminates redundancy across layers, unifies metric definitions, and supports multi‑dimensional analysis for various business needs.
Fusion Computation Engine
The legacy C++ MapReduce (UPI) framework was replaced with a Spark‑based fusion engine that reuses resources via a long‑living Application Context, writes directly to Parquet without extra ETL scripts, and reduces job startup time. This upgrade cuts ETL processing from 40 minutes to 10 minutes and improves resource utilization by about 20%.
New Service Delivery (Turing 3.0)
Turing 3.0 integrates three products—Turing Data Engine (TDE), Turing Data Studio (TDS), and Turing Data Analysis (TDA)—to form a unified development paradigm. Data sets become the core artifact, enabling a closed loop of data set ↔ visual analysis ↔ dashboard, reducing delivery cycles from weeks to days and empowering self‑service analytics.
Performance Gains
Ad‑hoc query latency reduced from tens of seconds to a few seconds (≈5× speedup).
Complex field flattening improves query performance by 2.1×.
Parquet columnar storage with ZSTD compression and bucket sorting reduces storage by ~30% and improves I/O efficiency.
Merge‑Into on Iceberg cuts back‑fill time by ~54% compared with INSERT OVERWRITE.
Overall data‑warehouse table count decreased from hundreds to ~20, with a 30% reduction in storage and a 30% drop in operational cost.
Future Outlook
The team plans to extend the platform with generic data‑flow solutions, automated logging (including no‑code tracing), abstracted wide‑table model layers, and AI‑assisted development to further accelerate data‑driven product iteration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
