Industry Insights 10 min read

How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

This article outlines the step‑by‑step evolution of Baidu's Feed data warehouse—from traditional layered modeling to hour‑level core tables, then real‑time wide tables, and finally a flow‑batch integrated multi‑version wide‑table architecture—highlighting the motivations, design choices, challenges, and resulting benefits.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

Feed is the personalized recommendation stream in the Baidu App, aggregating articles, videos, and image collections. As business grew, the mobile ecosystem data team needed to redesign the Feed data warehouse to reduce redundancy, simplify query logic, and improve efficiency.

Background

Initially the warehouse followed a classic ODS‑>DWD‑>DWS‑>ADS layered model with additional dimension tables. Data was scattered across dozens of tables, totaling nearly 50 PB, leading to high extraction costs and complex downstream joins.

Stage 1: Hour‑Level Core Table + Topic Wide Table

To address high latency and complexity, the team built a 15‑minute streaming batch log table ( log_qi) and an hour‑level detailed wide table ( log_hi) using the internal TM streaming framework. Topic‑wide intermediate tables were also created to aggregate various dimensions, aligning with the DWD and ADS layers.

15‑minute batch log ( log_qi) parses raw Feed logs and embeds simple business logic.

Hour‑level detail table ( log_hi) contains richer business logic for external services.

Topic wide tables serve as intermediate aggregation layers.

Stage 1 diagram
Stage 1 diagram

Stage 2: Real‑Time Wide Table

Business demanded near‑real‑time data for experiment validation and monitoring, prompting the creation of a real‑time wide table ( log_5mi) whose schema mirrors the hour‑level table but embeds complex logic, enabling fast, low‑cost access to real‑time data.

Real‑time data flow
Real‑time data flow

Stage 3: Flow‑Batch Integrated Multi‑Version Wide Table

After building hour‑level, topic, and real‑time tables, new business requirements exposed issues such as inconsistent metrics between streaming and batch data, disparate data sources, duplicated processing, and high join costs (up to 30 TB with data skew).

The solution was a day‑level user‑resource detail table ( log_di) that consolidates the three previous tables, providing a unified data source for both real‑time and offline use.

Key design points:

Four‑level partitioning (source, user‑behavior, business direction, etc.) to isolate data and reduce join volume.

Versioned outputs (v1‑v6) with different freshness: real‑time, hourly, daily (T+1), and extended daily versions for resource, user, and fan‑relationship dimensions.

Versioned table design
Versioned table design

Benefits of the Re‑engineered Warehouse

Unified data source and export: downstream users query a single wide table, ensuring consistency across teams.

Multi‑version outputs allow switching between real‑time, hourly, and daily data without changing query logic.

Higher freshness and multidimensional integration improve reporting speed and support diverse analytical needs.

The restructured architecture is illustrated below:

Final warehouse diagram
Final warehouse diagram

Conclusion & Future Plans

Continuous business growth drives evolving data‑warehouse requirements. The current flow‑batch integrated multi‑version wide‑table system has simplified the Feed warehouse, but future scaling and complexity will demand further tool enhancements and architectural refinements to support decision‑making and innovation.

big dataReal-time analyticsData ModelingData WarehouseversioningfeedWide Table
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.