How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era
Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.
Business and Data Overview
Xiaohongshu is a lifestyle community app with over 350 million monthly active users. Its product loops—community notes, live streams, and an e‑commerce marketplace—are heavily data‑driven, generating daily log volumes of several hundred billion events. The platform originally relied on a conventional data‑warehouse stack (custom scheduling, operations, asset‑management, governance, and reporting tools) to deliver four major data‑value categories: analytics, data products, data services, and AI‑driven insights.
In 2024 the underlying infrastructure migrated from AWS to Alibaba Cloud, moving 500 PB of data, 110 k tasks, and involving 1.5 k engineers across 40+ departments, setting an industry‑record for migration complexity.
Evolution of the Data Architecture
1.0 – ClickHouse‑Centric Instant Analytics
The first architecture loaded wide‑table data from an offline warehouse into ClickHouse for near‑real‑time analysis. This reduced query latency from minutes (Spark SQL) to seconds (ClickHouse) but introduced three critical drawbacks:
High cost: ClickHouse clusters demand substantial CPU and memory resources.
Scaling difficulty: ClickHouse’s compute‑storage integration makes horizontal scaling and data migration painful.
Poor data freshness: Data arrived via a Spark T+1 pipeline, causing noticeable latency for business users.
2.0 – Lambda Architecture with Flink + Spark
To address the 1.0 pain points, Xiaohongshu built a storage‑separated Lambda architecture on top of open‑source ClickHouse. The MergeTree files were synchronized to object storage and local SSDs, extending the queryable time range and lowering storage cost.
Key enhancements included:
Real‑time ingestion via Flink and batch processing via Spark, both feeding ClickHouse as a unified serving layer.
Daily ingestion of ~600 billion user‑behavior events, plus user‑profile and tag data, enabling cross‑analysis.
Performance optimizations such as local joins per user, materialized views covering 70 % of queries (compressing 600 billion rows to ~200 billion), and Bloom‑filter indexes on user IDs.
These changes delivered sub‑second query response on petabyte‑scale data and eliminated the need for ad‑hoc data requests from the data‑services team.
3.0 – Lakehouse with Incremental Compute
The 2.0 design still suffered from dual storage (object storage vs. ClickHouse files) and dual compute engines (Flink vs. Spark), leading to higher storage cost, data inconsistency risk, and divergent code bases. The 3.0 solution adopted a lakehouse model:
Flink processes logs into Iceberg tables (data lake).
Spark runs batch jobs on Iceberg.
StarRocks provides fast T+1 analytics on the dws wide tables.
Automatic Z‑Order sorting and intelligent index management reduced scanned data from 5.5 TB to ~600 GB (≈10× reduction) and improved P90 query latency to ~5 seconds. Compression rates doubled compared with the ClickHouse‑only setup, and overall query performance improved three‑fold.
Generic Incremental Compute
The article defines “generic incremental compute” as the fourth‑generation data‑processing paradigm that simultaneously satisfies high performance and low latency, extending the classic data‑impossibility triangle (freshness, cost, efficiency). It adheres to four SPOT standards:
S – A single, complete data expression supporting all operators.
P – High performance at low cost.
O – Open, allowing multiple engines to consume the same data.
T – Tunable without code changes.
In a 2025 proof‑of‑concept with Cloud‑Q Technology, Xiaohongshu rewrote Spark jobs as incremental pipelines, achieving:
Resource cost reduced to one‑third.
Component count reduced to one‑third (single storage + single compute engine).
Development effort reduced to one‑third (one pipeline replaces both batch and streaming).
Performance tests showed fresh‑per‑5‑minute latency with 1–2× speedup over pure Spark, and real‑time aggregation tasks consuming only ~25 % of the resources required by traditional Flink implementations.
Practical Benefits and Outlook
Incremental compute also enabled efficient handling of unstructured data via Json Flatter (columnar materialization) and a 10× boost in date‑skipping queries using inverted indexes. The lakehouse now stores >300 PB of data with daily growth of 4 PB, all ingested near‑real‑time.
Future directions include tighter integration of AI‑driven analysis, further query‑performance tuning under Iceberg, and continued exploration of unified knowledge‑graph services that combine structured tables, vector indexes, and scalar metadata for conversational AI.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
