How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×
This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.
Background and Problem
Rapid business iteration and massive data generation in internet products create challenges such as data silos, redundant storage, and slow query performance, especially for Baidu’s search business where data volume exceeds hundreds of PB.
Traditional Warehouse Limitations
The classic ODS → DWD → DWS → ADS layered model suffers from large tables, slow queries, storage redundancy, and inconsistent metrics.
New Architecture Overview
Baidu introduced the Turing ecosystem: TDS (data development & governance), TDA (visual BI), and TDE (Spark + ClickHouse fusion engine). The solution adopts wide‑table + dataset modeling, Parquet columnar storage with ZSTD compression, and a unified compute engine to replace the legacy C++ MapReduce (UPI) framework.
Modeling Approach
For each business theme a wide table is built by flattening nested fields, keeping ODS/DWD granularity, and integrating all required dimensions and metrics. This reduces table count, eliminates redundancy, and aligns data definitions with business needs.
Compute Engine Upgrade
Traditional C++ MR required disk‑based shuffle and suffered instability. Spark provides in‑memory DAG execution, multi‑threading, and higher resource efficiency. Combined with the fusion engine, ETL jobs are reduced from 40 min to 10 min, achieving up to 5× speedup for ad‑hoc queries.
Key Optimizations
Parquet columnar storage with bucket‑sort and ZSTD achieves high compression and data‑skipping.
Merge‑Into replaces insert‑overwrite for incremental updates, cutting back‑track time by ~54%.
Reordering and compression (RLE, Delta) lower storage cost by ~20%.
Automatic field‑frequency statistics and pruning reduce resource consumption by 50%.
Benefits
Wide‑table + fusion engine delivers faster query response (seconds vs minutes), consistent metric definitions, 30% storage reduction, and a 2–3× improvement in data‑lineage maintenance efficiency.
Summary and Outlook
The new model shifts from “demand‑delivery” to a data‑set‑centric paradigm, enabling self‑service analytics for most users, shortening delivery cycles from weeks to days, and laying the foundation for future AI‑driven data engineering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
