Big Data 22 min read

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 350 million monthly users and daily logs in the billions, migrated its data platform from AWS to Alibaba Cloud and iterated four times—from a ClickHouse‑based ad‑hoc layer to a Lambda architecture and finally a Lakehouse with incremental compute—cutting architecture complexity, resource cost and development effort each to about one‑third while delivering second‑level analytics on trillion‑scale data.

DataFunTalk

Jun 24, 2026

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu is a lifestyle community app with more than 350 million monthly active users. Its business (community notes, live streaming, e‑commerce, etc.) is heavily data‑driven, generating daily logs of several hundred billion events, which creates massive real‑time and offline data demands.

The existing data platform follows the industry‑standard data‑warehouse model and includes a self‑built scheduling system, operations platform, asset‑management, governance, and reporting tools. Data value output is divided into four categories: analytics for executives and front‑line teams, data products for advertisers and merchants, data services for recommendation/search algorithms, and AI‑driven insights.

In 2024 the infrastructure layer was migrated from AWS to Alibaba Cloud, moving 500 PB of data across 110 000 tasks with 1 500 participants from more than 40 departments—setting an industry record for migration scale.

To improve data acquisition cost, usage efficiency and lower the barrier for product teams, Xiaohongshu performed four architecture iterations.

1.0 – ClickHouse‑based ad‑hoc analysis

The initial architecture used ClickHouse for instant queries after Spark‑SQL batch processing, reducing latency from minutes to seconds. However, it suffered from high cluster cost, difficult scaling due to ClickHouse’s compute‑storage coupling, and poor data freshness because data arrived after a T+1 Spark batch.

2.0 – Store‑compute separation & Lambda integration

ClickHouse’s MergeTree files were synchronized to object storage and local SSD, extending the queryable time range and lowering storage cost. A Lambda layer merged Flink‑produced real‑time tables with Spark‑produced batch tables in ClickHouse, delivering day‑level to real‑time insights. Daily, about 600 billion rows of user‑behavior logs are ingested into ClickHouse.

Local join on user dimension to support feature‑rich analysis.

Materialized views covering 70 % of queries compressed 600 billion rows to roughly 200 billion rows.

Bloom‑filter index on user‑ID for fast lookup of individual behavior.

These optimizations enabled second‑level analysis on trillion‑scale data for over 200 internal products.

3.0 – Lakehouse (Iceberg + Flink + Spark + StarRocks)

The 2.0 architecture left three pain points: dual storage (object store vs ClickHouse files), dual compute frameworks (Flink vs Spark) with semantic gaps, and lack of ETL capability in ClickHouse. The Lakehouse unifies storage (Iceberg) and compute, allowing both batch and real‑time pipelines to share the same data files.

Key optimizations include automatic Z‑Order sorting and intelligent sorting based on StarRocks query logs. Scanned data per table dropped from >5.5 TB to ~600 GB (≈10× reduction), achieving P90 query latency around 5 seconds and overall query performance three times higher than the previous ClickHouse‑only setup.

General Incremental Compute

The article defines “general incremental compute” as the fourth generation of data processing that simultaneously targets high performance and low latency, satisfying the Kappa architecture and the “medal” model. It introduces the SPOT standards:

S : System must support a unified full‑data expression for all operators.

P : High performance with low cost.

O : Open – both data and AI engines can consume the same data.

T : Tunable – business can adjust parameters without code changes.

Yunqi’s implementation for Xiaohongshu achieved three‑thirds reductions: resource cost, component count, and development effort each fell to roughly one‑third of the previous baseline.

Validation Results

Functional parity with Spark; incremental tables passed accuracy checks.

Freshness improved from T+1 to a 5‑minute window; incremental tables 1‑2× faster than Spark offline jobs.

Real‑time aggregation cost about ¼ of a traditional Flink job.

Json Flatter transformed unstructured JSON into columnar format, doubling compression and query speed.

Inverted index with date‑skipping boosted query efficiency by 10×.

The unified architecture also lowered development and maintenance overhead for algorithm teams that require rapid, low‑latency data for A/B testing and strategy validation.

Future Outlook

Looking ahead, Xiaohongshu plans to deepen stream‑batch convergence, further accelerate Iceberg query performance, and integrate AI‑driven BI that can automatically generate logical views and materialized accelerations for business users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink ClickHouse Spark Xiaohongshu Data Architecture Lakehouse Incremental Compute

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1.0 – ClickHouse‑based ad‑hoc analysis

2.0 – Store‑compute separation & Lambda integration

3.0 – Lakehouse (Iceberg + Flink + Spark + StarRocks)

General Incremental Compute

Validation Results

Future Outlook

DataFunTalk

How this landed with the community

Was this worth your time?

0 Comments

3.0 – Lakehouse (Iceberg + Flink + Spark + StarRocks)