Big Data 22 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's step‑by‑step migration from a simple ClickHouse‑based analytics stack to a Lambda‑style 2.0 architecture and finally to a Lakehouse‑based 3.0 design, highlighting concrete performance numbers, cost reductions, and the definition of a generic incremental‑compute model (SPOT) that underpins the evolution.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu, a lifestyle community with over 3.5 billion monthly active users, runs a data‑driven business that generates daily logs of several hundred billion events, creating strong real‑time and offline data demands.

1. Business and Data Overview

The platform supports four data‑value categories: analytics for executives, self‑service data products for merchants and creators, data services for recommendation/search teams, and AI‑driven insight generation. In 2024 the underlying infrastructure was migrated from AWS to Alibaba Cloud, moving 500 PB of data across 110 k tasks with 1 500 participants from more than 40 departments.

2. 1.0 Architecture – ClickHouse‑Centric Ad‑hoc Analytics

Initially the stack used ClickHouse as the sole serving layer, loading wide tables produced by Spark‑SQL into ClickHouse for near‑real‑time queries. This reduced query latency from minutes to seconds but introduced three major drawbacks:

High resource cost – ClickHouse clusters require substantial CPU and memory.

Scaling difficulty – storage‑compute coupling makes data migration painful during rapid growth.

Staleness – Data arrived via a Spark T+1 batch, causing latency for business users.

3. 2.0 Architecture – Lambda with Storage Separation

To address the pain points, Xiaohongshu built a Lambda architecture on top of open‑source ClickHouse. The MergeTree files were synchronized to object storage and local SSDs, extending the time range of queryable data and cutting storage cost.

Key enhancements:

Real‑time streams from Flink and offline batches from Spark were merged in ClickHouse, delivering day‑level to real‑time metrics.

Materialized views and multi‑type joins accelerated user‑behavior queries.

Daily ingestion reached ~6 000 billion rows, including user‑generated notes and tags.

Performance optimizations such as local joins, materialized views covering 70 % of queries, and Bloom‑filter indexes on user IDs reduced query latency to the second level for >200 internal products.

4. 3.0 Architecture – Lakehouse with Incremental Compute

The next iteration unified the data lake and warehouse into a Lakehouse. Flink writes logs to Iceberg tables, Spark processes batch jobs, and StarRocks serves fast T+1 analytics on the DWS wide tables. The design eliminated the dual‑storage problem of the 2.0 stack and reduced ETL duplication.

Performance gains:

Automatic Z‑Order sorting and intelligent re‑sorting cut scanned data from >5.5 TB per query to ~600 GB (≈10× reduction), achieving 80‑90 % Z‑Order hit rates.

Overall query throughput improved ~3× compared with the previous ClickHouse‑only solution, and P90 latency stabilized around 5 seconds.

Compression ratio on lake storage doubled relative to ClickHouse.

5. Generic Incremental Compute and SPOT Standards

The article defines “generic incremental compute” as the fourth generation of data processing that simultaneously satisfies high performance and low latency, extending the classic batch‑stream‑interactive triangle. Four SPOT standards are proposed:

S – Full‑stack support for incremental operators (no mixed‑mode pipelines).

P – High performance at low cost.

O – Openness to multiple engines (SQL, Python, AI functions).

T – Tunable configuration without code changes.

In practice, Xiaohongshu’s incremental pipeline (implemented with Cloud‑Q’s technology) achieved:

Resource cost reduced to one‑third of the previous setup.

Component count cut to one‑third (single storage + single compute engine).

Development effort reduced to one‑third (a single pipeline replaces multiple Spark/Flink jobs).

6. Outcomes and Outlook

After the 3.0 rollout, the data platform supports second‑level analytics on trillion‑scale datasets, serves >200 products, and enables AI‑driven insights with a target of 70 % data‑usage penetration among front‑line teams. Future directions include tighter stream‑batch integration, further query‑performance tuning on Iceberg, and AI‑centric knowledge‑graph services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

big dataFlinkClickHouseSparkXiaohongshudata architectureLakehouseIncremental Compute
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.