Industry Insights 10 min read

How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

This article outlines the step‑by‑step evolution of Baidu's Feed data warehouse—from traditional layered modeling to hour‑level core tables, then real‑time wide tables, and finally a flow‑batch integrated multi‑version wide‑table architecture—highlighting the motivations, design choices, challenges, and resulting benefits.

Baidu Tech Salon

Jul 11, 2024

How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

Feed is the personalized recommendation stream in the Baidu App, aggregating articles, videos, and image collections. As business grew, the mobile ecosystem data team needed to redesign the Feed data warehouse to reduce redundancy, simplify query logic, and improve efficiency.

Background

Initially the warehouse followed a classic ODS‑>DWD‑>DWS‑>ADS layered model with additional dimension tables. Data was scattered across dozens of tables, totaling nearly 50 PB, leading to high extraction costs and complex downstream joins.

Stage 1: Hour‑Level Core Table + Topic Wide Table

To address high latency and complexity, the team built a 15‑minute streaming batch log table ( log_qi) and an hour‑level detailed wide table ( log_hi) using the internal TM streaming framework. Topic‑wide intermediate tables were also created to aggregate various dimensions, aligning with the DWD and ADS layers.

15‑minute batch log ( log_qi) parses raw Feed logs and embeds simple business logic.

Hour‑level detail table ( log_hi) contains richer business logic for external services.

Topic wide tables serve as intermediate aggregation layers.

Stage 2: Real‑Time Wide Table

Business demanded near‑real‑time data for experiment validation and monitoring, prompting the creation of a real‑time wide table ( log_5mi) whose schema mirrors the hour‑level table but embeds complex logic, enabling fast, low‑cost access to real‑time data.

Stage 3: Flow‑Batch Integrated Multi‑Version Wide Table

After building hour‑level, topic, and real‑time tables, new business requirements exposed issues such as inconsistent metrics between streaming and batch data, disparate data sources, duplicated processing, and high join costs (up to 30 TB with data skew).

The solution was a day‑level user‑resource detail table ( log_di) that consolidates the three previous tables, providing a unified data source for both real‑time and offline use.

Key design points:

Four‑level partitioning (source, user‑behavior, business direction, etc.) to isolate data and reduce join volume.

Versioned outputs (v1‑v6) with different freshness: real‑time, hourly, daily (T+1), and extended daily versions for resource, user, and fan‑relationship dimensions.

Benefits of the Re‑engineered Warehouse

Unified data source and export: downstream users query a single wide table, ensuring consistency across teams.

Multi‑version outputs allow switching between real‑time, hourly, and daily data without changing query logic.

Higher freshness and multidimensional integration improve reporting speed and support diverse analytical needs.

The restructured architecture is illustrated below:

Conclusion & Future Plans

Continuous business growth drives evolving data‑warehouse requirements. The current flow‑batch integrated multi‑version wide‑table system has simplified the Feed warehouse, but future scaling and complexity will demand further tool enhancements and architectural refinements to support decision‑making and innovation.

big data Real-time analytics Data Modeling Data Warehouse versioning feed Wide Table

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Stage 1: Hour‑Level Core Table + Topic Wide Table

Stage 2: Real‑Time Wide Table

Stage 3: Flow‑Batch Integrated Multi‑Version Wide Table

Benefits of the Re‑engineered Warehouse

Conclusion & Future Plans

Baidu Tech Salon

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: Hour‑Level Core Table + Topic Wide Table

Stage 2: Real‑Time Wide Table

Stage 3: Flow‑Batch Integrated Multi‑Version Wide Table