How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables
This article outlines the step‑by‑step evolution of Baidu's Feed data warehouse—from traditional layered modeling to hour‑level core tables, then real‑time wide tables, and finally a flow‑batch integrated multi‑version wide‑table architecture—highlighting the motivations, design choices, challenges, and resulting benefits.
Feed is the personalized recommendation stream in the Baidu App, aggregating articles, videos, and image collections. As business grew, the mobile ecosystem data team needed to redesign the Feed data warehouse to reduce redundancy, simplify query logic, and improve efficiency.
Background
Initially the warehouse followed a classic ODS‑>DWD‑>DWS‑>ADS layered model with additional dimension tables. Data was scattered across dozens of tables, totaling nearly 50 PB, leading to high extraction costs and complex downstream joins.
Stage 1: Hour‑Level Core Table + Topic Wide Table
To address high latency and complexity, the team built a 15‑minute streaming batch log table ( log_qi) and an hour‑level detailed wide table ( log_hi) using the internal TM streaming framework. Topic‑wide intermediate tables were also created to aggregate various dimensions, aligning with the DWD and ADS layers.
15‑minute batch log ( log_qi) parses raw Feed logs and embeds simple business logic.
Hour‑level detail table ( log_hi) contains richer business logic for external services.
Topic wide tables serve as intermediate aggregation layers.
Stage 2: Real‑Time Wide Table
Business demanded near‑real‑time data for experiment validation and monitoring, prompting the creation of a real‑time wide table ( log_5mi) whose schema mirrors the hour‑level table but embeds complex logic, enabling fast, low‑cost access to real‑time data.
Stage 3: Flow‑Batch Integrated Multi‑Version Wide Table
After building hour‑level, topic, and real‑time tables, new business requirements exposed issues such as inconsistent metrics between streaming and batch data, disparate data sources, duplicated processing, and high join costs (up to 30 TB with data skew).
The solution was a day‑level user‑resource detail table ( log_di) that consolidates the three previous tables, providing a unified data source for both real‑time and offline use.
Key design points:
Four‑level partitioning (source, user‑behavior, business direction, etc.) to isolate data and reduce join volume.
Versioned outputs (v1‑v6) with different freshness: real‑time, hourly, daily (T+1), and extended daily versions for resource, user, and fan‑relationship dimensions.
Benefits of the Re‑engineered Warehouse
Unified data source and export: downstream users query a single wide table, ensuring consistency across teams.
Multi‑version outputs allow switching between real‑time, hourly, and daily data without changing query logic.
Higher freshness and multidimensional integration improve reporting speed and support diverse analytical needs.
The restructured architecture is illustrated below:
Conclusion & Future Plans
Continuous business growth drives evolving data‑warehouse requirements. The current flow‑batch integrated multi‑version wide‑table system has simplified the Feed warehouse, but future scaling and complexity will demand further tool enhancements and architectural refinements to support decision‑making and innovation.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
