How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes
This article details Baidu APP's massive data‑warehouse overhaul, describing the two‑step strategy that stabilized log cleaning, modernized the ETL framework, introduced wide‑table architectures, and implemented tiered storage to dramatically improve processing speed, reliability, and cost efficiency for petabyte‑scale workloads.
Background
Baidu APP and its product matrix generate billions of page views daily and store over a hundred petabytes of data, causing severe pressure on the log‑cleaning stage, resource consumption, latency, and high storage cost.
Challenges
Log cleaning suffers from unstable processing, high cost, and fragmented storage. Real‑time and offline pipelines are tightly coupled, core and peripheral data are not isolated, and the legacy UDW/ETL framework (C++/MapReduce) lacks flexibility, fault tolerance, and scalability.
Two‑step Upgrade Strategy
Step 1 – Log‑Cleaning Stabilization
Increase scheduling granularity from hourly to 15‑minute intervals.
Pre‑parse over 100 high‑frequency event fields into flat, standardized columns.
Merge small files and apply ZSTD compression, reducing daily file count by 93 % and saving more than 420 TB of storage.
Step 2 – Data‑Warehouse Refactoring
Adopt a “wide‑table” Turing model while keeping a UDW buffer for smooth migration.
Introduce a dual‑track output: optimized UDW tables and hourly Turing tables.
Separate core and non‑core data, decouple real‑time and batch streams.
Upgrade computation framework to TM + Spark, improving fault tolerance and resource scheduling.
Additional Optimizations
AB‑Experiment Refactor
Extract experiment IDs into an independent dimension table, avoiding wide‑table bloat and cutting query latency.
Tiered Storage Governance
Implement a hot‑warm‑cold three‑level storage policy, automatically moving infrequently accessed historical data to cheaper storage while retaining hot data for fast access.
Outcome
The systematic upgrade dramatically improved data‑processing timeliness, stability, and cost efficiency, supporting Baidu’s massive traffic growth and providing a scalable foundation for future data‑driven innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
