Big Data 22 min read

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article analyzes Xiaohongshu's data platform evolution—from a simple ClickHouse‑based analytics layer to a Lambda architecture and finally a lakehouse design—highlighting how adopting a new incremental computing model reduced architecture complexity, resource consumption, and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

Background

Xiaohongshu is a lifestyle community with >350 million monthly active users. The app generates billions of logs per day, creating demanding real‑time and offline data requirements for community, commerce and business analytics.

Data Architecture Evolution

1.0 ClickHouse‑based ad‑hoc analysis

Data processed in Spark with a T+1 delay, then loaded into ClickHouse for second‑level queries.

High infrastructure cost because ClickHouse couples compute and storage.

Scaling difficulty due to compute‑storage coupling.

Data freshness limited by the batch window.

2.0 Lambda architecture (Flink + Spark + ClickHouse)

Introduced storage‑separated ClickHouse MergeTree files synced to object storage and local SSD, extending the queryable time range and lowering storage cost. Real‑time streams from Flink and offline batches from Spark are merged in ClickHouse, providing near‑real‑time to day‑level insights. Key optimizations:

Local join on user dimension to support user‑centric analysis.

Materialized views that reduce daily 6000 billion rows to ~200 billion (≈70 % query hit rate).

Bloom‑filter index on user_id for fast point look‑ups.

Use of ClickHouse multi‑type joins, materialized views and index acceleration.

3.0 Lakehouse (Iceberg + Flink + Spark + StarRocks)

Unified data lake and warehouse with Apache Iceberg as the storage layer, Flink for ingestion, Spark for batch jobs, and StarRocks for fast analytics. Benefits:

Eliminated duplicate storage; all data lives in a single lake.

Automatic Z‑Order sorting and intelligent data‑file rewrite based on query patterns.

Scanned data per query dropped from 5.5 TB to ~600 GB (≈10× improvement).

Sub‑5‑second P90 latency for typical analytical queries.

Compression ratio roughly doubled compared with the previous ClickHouse solution.

Overall query performance improved ~3×; with local cache P90 latency stays around 5 s.

Incremental Computing and SPOT Standards

Incremental computing is positioned as the fourth generation of data processing, aiming to satisfy the “data impossibility triangle” (freshness, cost, performance). The SPOT criteria define a production‑ready incremental engine:

S : Full‑stack support for all operators in incremental mode (no operator is excluded).

P : High performance with low resource cost.

O : Open architecture that allows multiple engines (e.g., Flink, Spark, StarRocks) to consume the same data.

T : Tunable via configuration without code changes.

Pilot Outcomes at Xiaohongshu

Resource usage reduced to one‑third (≈1800 CPU cores vs 5000 cores for the previous Spark T+1 pipeline).

Component count cut by two‑thirds, simplifying operations.

Development effort lowered to one‑third of the legacy workflow.

Freshness improved to a 5‑minute interval while maintaining or exceeding Spark batch performance (1‑2× speedup for incremental tables).

Real‑time aggregation tasks cost roughly ¼ of a comparable Flink implementation.

Key Technical Optimizations

Local join on user dimension to combine behavior and profile data efficiently.

Materialized views that compress 6000 billion daily rows to ~200 billion, covering ~70 % of queries.

Bloom‑filter index on user_id for rapid point queries.

Json Flatter : Columnarizes JSON fields, improving compression and query speed for semi‑structured data.

Inverted index for experiment‑group fields, delivering a 10× speedup for cohort analysis.

Automatic Z‑Order rewrite in Iceberg based on StarRocks query logs, reducing scanned data from 5.5 TB to ~600 GB (≈10×).

Overall storage compression doubled relative to ClickHouse, and query latency consistently stays under 5 seconds for 90 % of workloads.

Future Directions

Continue to push data freshness toward near‑real‑time, integrate AI‑driven analytics (e.g., vector search, multimodal governance), and further enhance the lakehouse stack for multimodal data governance and low‑cost, high‑performance query processing.

Performance optimizationbig dataXiaohongshudata architecturelakehouseincremental computing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.