Big Data 17 min read

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

This article explains how Baidu’s Feed real‑time data warehouse was rebuilt using a pure streaming architecture, detailing the limitations of the previous stream‑batch design, the technical solutions—including core/non‑core data separation, metric calculation in streaming, and Parquet storage with Apache Arrow—and the resulting cost reductions, latency improvements, and future roadmap.

Baidu Geek Talk

Sep 24, 2025

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

Introduction

Feed real‑time data warehouse builds a 15‑minute stream‑batch log table from feed logs, providing the most granular user‑level data as the foundational wide table for Feed.

Challenges in the Existing Architecture

Complex and costly compute pipeline (stream + batch) leading to 45‑50 minute latency.

Downstream services have inconsistent metrics and dimensions.

Mixed core and non‑core data cause stability issues and resource contention.

Re‑architected Solution

The new design replaces the stream‑batch hybrid with a pure streaming architecture based on the TM framework, integrates field‑format unification and metric calculation into the streaming job, and separates core and non‑core data at the ingestion layer.

Key Improvements

Field extraction unchanged; added metric calculation to streaming, eliminating the need for downstream recomputation.

Unified field format and direct ORC/Parquet output in the streaming job, reducing resource usage by ~200 k CNY/year.

Adopted Parquet (with Apache Arrow) for columnar storage, enabling predicate push‑down, schema evolution, and better compression.

Optimized TMsinker task fetching and data‑batching parameters to reduce small‑file generation.

Switched ZSTD compression to the advanced API for multithreaded performance.

Results and Future Plans

After migration, compute cost dropped ~50 %, data‑to‑application latency shortened by 30 minutes, query efficiency improved by 90 %, and core/non‑core data isolation increased system stability. Future work includes moving to a modern streaming engine, refactoring C++ jobs, and positioning the warehouse as an internal real‑time data platform.

stream processing Batch Processing Real-time Data Warehouse Parquet Apache Arrow

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.