Big Data 21 min read

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Architect

Jul 7, 2025

Background and Problem

Rapid business iteration and massive data generation in internet products create challenges such as data silos, redundant storage, and slow query performance, especially for Baidu’s search business where data volume exceeds hundreds of PB.

Traditional Warehouse Limitations

The classic ODS → DWD → DWS → ADS layered model suffers from large tables, slow queries, storage redundancy, and inconsistent metrics.

New Architecture Overview

Baidu introduced the Turing ecosystem: TDS (data development & governance), TDA (visual BI), and TDE (Spark + ClickHouse fusion engine). The solution adopts wide‑table + dataset modeling, Parquet columnar storage with ZSTD compression, and a unified compute engine to replace the legacy C++ MapReduce (UPI) framework.

Modeling Approach

For each business theme a wide table is built by flattening nested fields, keeping ODS/DWD granularity, and integrating all required dimensions and metrics. This reduces table count, eliminates redundancy, and aligns data definitions with business needs.

Compute Engine Upgrade

Traditional C++ MR required disk‑based shuffle and suffered instability. Spark provides in‑memory DAG execution, multi‑threading, and higher resource efficiency. Combined with the fusion engine, ETL jobs are reduced from 40 min to 10 min, achieving up to 5× speedup for ad‑hoc queries.

Key Optimizations

Parquet columnar storage with bucket‑sort and ZSTD achieves high compression and data‑skipping.

Merge‑Into replaces insert‑overwrite for incremental updates, cutting back‑track time by ~54%.

Reordering and compression (RLE, Delta) lower storage cost by ~20%.

Automatic field‑frequency statistics and pruning reduce resource consumption by 50%.

Benefits

Wide‑table + fusion engine delivers faster query response (seconds vs minutes), consistent metric definitions, 30% storage reduction, and a 2–3× improvement in data‑lineage maintenance efficiency.

Summary and Outlook

The new model shifts from “demand‑delivery” to a data‑set‑centric paradigm, enabling self‑service analytics for most users, shortening delivery cycles from weeks to days, and laying the foundation for future AI‑driven data engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data modeling Data Warehouse ETL Spark Parquet

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.