How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings
This article explains how Baidu’s Feed real‑time data warehouse was rebuilt using a pure streaming architecture, detailing the limitations of the previous stream‑batch design, the technical solutions—including core/non‑core data separation, metric calculation in streaming, and Parquet storage with Apache Arrow—and the resulting cost reductions, latency improvements, and future roadmap.
Introduction
Feed real‑time data warehouse builds a 15‑minute stream‑batch log table from feed logs, providing the most granular user‑level data as the foundational wide table for Feed.
Challenges in the Existing Architecture
Complex and costly compute pipeline (stream + batch) leading to 45‑50 minute latency.
Downstream services have inconsistent metrics and dimensions.
Mixed core and non‑core data cause stability issues and resource contention.
Re‑architected Solution
The new design replaces the stream‑batch hybrid with a pure streaming architecture based on the TM framework, integrates field‑format unification and metric calculation into the streaming job, and separates core and non‑core data at the ingestion layer.
Key Improvements
Field extraction unchanged; added metric calculation to streaming, eliminating the need for downstream recomputation.
Unified field format and direct ORC/Parquet output in the streaming job, reducing resource usage by ~200 k CNY/year.
Adopted Parquet (with Apache Arrow) for columnar storage, enabling predicate push‑down, schema evolution, and better compression.
Optimized TMsinker task fetching and data‑batching parameters to reduce small‑file generation.
Switched ZSTD compression to the advanced API for multithreaded performance.
Results and Future Plans
After migration, compute cost dropped ~50 %, data‑to‑application latency shortened by 30 minutes, query efficiency improved by 90 %, and core/non‑core data isolation increased system stability. Future work includes moving to a modern streaming engine, refactoring C++ jobs, and positioning the warehouse as an internal real‑time data platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
