Big Data 10 min read

Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Baidu’s Mobile Ecology team transformed its Feed data warehouse through three progressive stages—hour‑level core tables, a real‑time wide table, and a unified day‑level multi‑version table—consolidating traffic, content, and user data into a single partitioned wide‑table architecture that resolves granularity inconsistencies, cuts processing cost, and delivers real‑time to daily latency for diverse analytics.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Evolution of Feed Data Warehouse Wide-Table Modeling at Baidu App

Feed is a personalized recommendation stream on Baidu App that aggregates various content types such as articles, videos, and image collections. As the business grew, the Mobile Ecology Data R&D team iterated the Feed data warehouse wide‑table modeling to integrate traffic, content, and user data, build multi‑version wide tables, achieve warehouse consistency, simplify extraction logic, and reduce cost while improving efficiency.

The evolution is divided into three stages:

Stage 1 – Hour‑level core tables + topic wide tables : A 15‑minute streaming‑batch log table (log_qi) and an hour‑level detail wide table (log_hi) were built using the in‑house TM streaming framework. These tables embed basic business logic and serve as the main data source for downstream services. Topic wide tables act as intermediate tables that aggregate other thematic data.

Stage 2 – Real‑time wide table : To meet stricter latency requirements, a real‑time wide table (log_5mi) was created, mirroring the schema of the hour‑level table but with more complex business logic pushed down. This enables near‑real‑time monitoring and rapid iteration of experiments or policies.

Stage 3 – Stream‑batch unified multi‑version wide tables : After completing hour‑level, topic, and real‑time tables, a day‑level user‑resource detail table (log_di) was introduced to unify data sources and resolve inconsistencies between real‑time and offline data. The table is partitioned into four dimensions (log source, user behavior, business direction, and a fourth custom dimension) and split into six versions (v1–v6) with different production cycles (real‑time, hourly, daily) and downstream applications.

Key challenges addressed include:

Inconsistent data granularity between streaming and batch pipelines.

Disparate data sources (e.g., search, live streaming) causing integration complexity.

Redundant data processing and high join costs, with intermediate data volumes reaching ~30 TB and data skew.

Design solutions involve:

Unifying data sources into a single wide table and pushing complex logic downstream.

Four‑level partitioning to limit data scanned per query.

Versioned outputs to satisfy different latency requirements without changing query logic.

Pre‑filtering of resource and follower tables and using Spark AQE to mitigate data skew.

The final architecture consists of three aligned schemas (log_hi, log_5mi, log_di) that together form a multi‑version wide‑table warehouse. Six versions (v1–v6) provide real‑time, hourly, and daily data with varying dimensions (resource, user, follower relationships) for use cases such as real‑time monitoring, core metric dashboards, and downstream reporting.

Benefits of the reconstructed warehouse include unified data source and export, multi‑version outputs for diverse latency needs, and high‑timeliness multi‑dimensional integration, enabling downstream teams to query a single table while supporting a wide range of analytical scenarios.

real-timeBig DataData ModelingData WarehouseSparkfeedwide table
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.