Big Data 21 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu is a lifestyle community app with over 350 million monthly active users; its business (community notes, live streams, e‑commerce) generates daily logs of several hundred billion rows, creating massive real‑time and batch data demands.

1.0 Data Architecture – Built on ClickHouse for ad‑hoc analysis. Data were processed in Spark (T+1) and loaded into ClickHouse, improving query latency from minutes to seconds. However, the design suffered from high resource cost, difficult scaling (storage‑compute coupling), and stale data because of the Spark‑to‑ClickHouse pipeline.

1.0 architecture diagram
1.0 architecture diagram

2.0 Data Architecture – Introduced storage separation and a Lambda architecture. Flink produced real‑time data, Spark produced batch data; both streams were merged in ClickHouse, enabling near‑real‑time metrics. Daily ingestion reached ~6 trillion rows, covering over 200 internal products. Performance optimizations included:

Local joins per user to support feature‑rich analysis.

Materialized views on enumerated fields, hitting 70 % of queries and compressing 6 trillion rows to ~2 trillion.

Bloom‑filter indexes on user IDs for fast look‑ups.

2.0 architecture diagram
2.0 architecture diagram

3.0 Lakehouse Architecture – Adopted a lakehouse that unifies data lake (Iceberg) and data warehouse (StarRocks). Flink writes logs to Iceberg, Spark runs batch jobs, and StarRocks provides fast T+1 analytics on dws wide tables. Automatic Z‑Order sorting and intelligent re‑sorting reduced scanned data from 5.5 TB to 600 GB (≈10× improvement) and cut query latency to ~5 s for the 90th percentile. Compression doubled compared with pure ClickHouse, and overall query performance improved three‑fold.

3.0 lakehouse diagram
3.0 lakehouse diagram

The platform also migrated its infrastructure from AWS to Alibaba Cloud in 2024, moving 500 PB of data across 110 000 tasks with 1 500 participants, setting an industry record for migration complexity.

General Incremental Compute – Defined as a fourth‑generation data‑processing model that simultaneously achieves high performance and low latency, satisfying the classic data‑impossibility triangle (freshness, cost, efficiency). The authors propose the SPOT standards:

S : Unified full‑data expression for all operators.

P : High performance at low cost.

O : Open architecture allowing multiple engines to consume the same data.

T : Tunable without code changes.

SPOT standards
SPOT standards

Applying these standards, Xiaohongshu reduced resource cost, component count, and development effort each to roughly one‑third of the previous baseline. Incremental compute pipelines achieved 1–2× speed‑up over Spark T+1 for freshness‑per‑5‑minute workloads, and resource consumption for real‑time aggregation fell to about 25 % of a comparable Flink job.

Future directions include deeper AI integration (semantic multimodal governance, Memory Lake hierarchical memory), hybrid‑cloud deployment, and expanding the lakehouse to support both structured and unstructured data with AI‑driven functions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkClickHouseSparkXiaohongshuData ArchitectureLakehouseIncremental Compute
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.