Big Data 19 min read

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details how Xiaohongshu's massive user‑generated data platform evolved from a simple ClickHouse‑based architecture to a multi‑stage lakehouse design, adopting a new incremental computing model that reduced architecture complexity, resource and development costs by one‑third while boosting query performance for petabyte‑scale data.

DataFunSummit
DataFunSummit
DataFunSummit
How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

1. Xiaohongshu Data Framework Evolution

Xiaohongshu is a lifestyle community app with over 350 million monthly active users, generating billions of logs daily and requiring both real‑time and offline data processing.

The platform originally used a ClickHouse‑based ad‑hoc analysis architecture (1.0), which offered second‑level query speed but suffered from high cost, difficult scaling, and poor data freshness.

High hardware cost due to ClickHouse cluster requirements.

Scaling challenges because ClickHouse combines storage and compute.

Data latency caused by Spark T+1 processing before loading into ClickHouse.

To address these issues, Xiaohongshu introduced a Lambda‑style architecture (2.0) that separated storage and compute, syncing ClickHouse MergeTree files to object storage and SSD, and merging real‑time Flink and batch Spark data in ClickHouse.

Further evolution led to a lakehouse architecture (3.0) integrating Flink for log ingestion, Iceberg for storage, Spark for batch jobs, and StarRocks for fast queries, achieving second‑level analysis over trillions of rows and supporting over 70% user penetration for data‑driven decisions.

2. General Incremental Computing

Incremental computing aims to break the data “impossible triangle” of freshness, cost, and performance by providing a unified data model, configurable pipelines, and performance comparable to the best of batch, streaming, and interactive processing.

The SPOT standards define four requirements for such systems:

S : Support a full‑set of operators in an incremental fashion.

P : Deliver high performance at low cost.

O : Remain open to multiple engines for AI‑driven workloads.

T : Allow flexible tuning without code changes.

In practice, Xiaohongshu’s incremental compute reduced resource usage to one‑third, cut component count by two‑thirds, and lowered development effort to one‑third of the previous pipeline.

3. Practical Benefits and Outlook

Key optimizations include local joins in ClickHouse, materialized views covering 70% of queries, Bloom filter indexes for user‑level analysis, and Z‑Order sorting to shrink scanned data from 5.5 TB to 600 GB, delivering up to ten‑fold speedups.

Future work focuses on further lakehouse enhancements, AI‑driven BI platforms, and expanding incremental computing to more business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Lakehouseincremental computing
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.