Industry Insights 24 min read

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Xiaohongshu Tech REDtech

May 19, 2025

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

1. Business Restatement

With the explosion of mobile‑internet content, Xiaohongshu generates daily logs at the hundred‑billion level, and algorithm experiment iteration demands minute‑level latency. Traditional batch architectures cannot balance low cost and low latency, prompting a shift toward incremental computing for near‑real‑time data processing.

2. Problem Analysis

Realtime‑offline metric discrepancy: Existing Lambda architecture causes data inconsistency and high resource consumption.

Complex pipeline maintenance: Separate Flink streams for log ingestion and dimension‑table updates increase operational cost.

Large‑window limitations: Flink’s state management restricts long windows, leading to latency spikes.

High resource cost: Long‑running Flink jobs consume significant resources as log volume grows.

3. Transformation Goals

The aim is to build a near‑real‑time pipeline that provides more precise, comprehensive, faster, and simpler data while reducing operational overhead.

4. Incremental Computing Solution

4.1 Paimon + Iceberg Architecture

Combining Paimon’s dynamic table and partial‑update capabilities with Iceberg’s open‑format storage and StarRocks for fast OLAP queries yields a minute‑level pipeline. However, the component count is high and stability at the hundred‑billion scale remains a concern.

4.2 Yunqi Generic Incremental Compute (GIC)

Yunqi’s Lakehouse engine offers a generic incremental compute model that updates only changed data, merging results with previous snapshots. Built on Iceberg, it supports dynamic tables, partial updates, and can be scheduled at minute intervals, achieving low latency with minimal resource usage.

4.3 Simplified Tech Stack

The Lakehouse engine unifies real‑time ETL, batch ETL, and online queries under standard SQL, reducing development complexity and enabling reuse of existing offline logic.

4.4 Model Design

Minute‑level DWS layer: Aggregates billions of logs into 5‑minute user‑level partitions, shrinking data size from billions to hundreds of millions and achieving sub‑10‑second query latency.

Realtime user dimension (DIM) layer: Updates high‑frequency dimensions every minute via Kafka, while low‑frequency dimensions are refreshed in batch.

Experiment array optimization: Stores experiment IDs as JSON arrays, leveraging built‑in JSON functions and automatic type inference to avoid wide tables.

User‑experiment dimension table: Uses array<bigint> with inverted indexing, boosting query speed by ~20×.

5. Comparative Evaluation

Compared with pure offline (day‑level) and pure realtime (resource‑heavy) pipelines, the near‑real‑time solution meets experiment observation needs, balances freshness and cost, and reduces resource consumption to roughly 36% of the original realtime system while keeping data deviation under 1%.

6. Business Impact

The new pipeline, built on open‑format Iceberg and Yunqi’s incremental engine, integrates with RedBI, delivers faster and more accurate experiment metrics, and enables self‑service addition of JSON‑based feature fields without schema changes, dramatically shortening development cycles.

7. Future Outlook

With minute‑level freshness, the near‑real‑time architecture bridges batch and streaming, supporting unified tables for realtime, offline, and near‑real‑time workloads, and is expected to become the backbone for increasingly many business scenarios at Xiaohongshu.