Big Data 25 min read

How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes

This article details Baidu APP's massive data‑warehouse overhaul, describing the two‑step strategy that stabilized log cleaning, modernized the ETL framework, introduced wide‑table architectures, and implemented tiered storage to dramatically improve processing speed, reliability, and cost efficiency for petabyte‑scale workloads.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes

Background

Baidu APP and its product matrix generate billions of page views daily and store over a hundred petabytes of data, causing severe pressure on the log‑cleaning stage, resource consumption, latency, and high storage cost.

Challenges

Log cleaning suffers from unstable processing, high cost, and fragmented storage. Real‑time and offline pipelines are tightly coupled, core and peripheral data are not isolated, and the legacy UDW/ETL framework (C++/MapReduce) lacks flexibility, fault tolerance, and scalability.

Two‑step Upgrade Strategy

Step 1 – Log‑Cleaning Stabilization

Increase scheduling granularity from hourly to 15‑minute intervals.

Pre‑parse over 100 high‑frequency event fields into flat, standardized columns.

Merge small files and apply ZSTD compression, reducing daily file count by 93 % and saving more than 420 TB of storage.

Step 2 – Data‑Warehouse Refactoring

Adopt a “wide‑table” Turing model while keeping a UDW buffer for smooth migration.

Introduce a dual‑track output: optimized UDW tables and hourly Turing tables.

Separate core and non‑core data, decouple real‑time and batch streams.

Upgrade computation framework to TM + Spark, improving fault tolerance and resource scheduling.

Additional Optimizations

AB‑Experiment Refactor

Extract experiment IDs into an independent dimension table, avoiding wide‑table bloat and cutting query latency.

Tiered Storage Governance

Implement a hot‑warm‑cold three‑level storage policy, automatically moving infrequently accessed historical data to cheaper storage while retaining hot data for fast access.

Outcome

The systematic upgrade dramatically improved data‑processing timeliness, stability, and cost efficiency, supporting Baidu’s massive traffic growth and providing a scalable foundation for future data‑driven innovations.

Performance optimizationbig dataData Warehousescalable architectureETL
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.