How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.
Xiaohongshu is a lifestyle community app with over 350 million monthly active users; its business (community notes, live streams, e‑commerce) generates daily logs of several hundred billion rows, creating massive real‑time and batch data demands.
1.0 Data Architecture – Built on ClickHouse for ad‑hoc analysis. Data were processed in Spark (T+1) and loaded into ClickHouse, improving query latency from minutes to seconds. However, the design suffered from high resource cost, difficult scaling (storage‑compute coupling), and stale data because of the Spark‑to‑ClickHouse pipeline.
2.0 Data Architecture – Introduced storage separation and a Lambda architecture. Flink produced real‑time data, Spark produced batch data; both streams were merged in ClickHouse, enabling near‑real‑time metrics. Daily ingestion reached ~6 trillion rows, covering over 200 internal products. Performance optimizations included:
Local joins per user to support feature‑rich analysis.
Materialized views on enumerated fields, hitting 70 % of queries and compressing 6 trillion rows to ~2 trillion.
Bloom‑filter indexes on user IDs for fast look‑ups.
3.0 Lakehouse Architecture – Adopted a lakehouse that unifies data lake (Iceberg) and data warehouse (StarRocks). Flink writes logs to Iceberg, Spark runs batch jobs, and StarRocks provides fast T+1 analytics on dws wide tables. Automatic Z‑Order sorting and intelligent re‑sorting reduced scanned data from 5.5 TB to 600 GB (≈10× improvement) and cut query latency to ~5 s for the 90th percentile. Compression doubled compared with pure ClickHouse, and overall query performance improved three‑fold.
The platform also migrated its infrastructure from AWS to Alibaba Cloud in 2024, moving 500 PB of data across 110 000 tasks with 1 500 participants, setting an industry record for migration complexity.
General Incremental Compute – Defined as a fourth‑generation data‑processing model that simultaneously achieves high performance and low latency, satisfying the classic data‑impossibility triangle (freshness, cost, efficiency). The authors propose the SPOT standards:
S : Unified full‑data expression for all operators.
P : High performance at low cost.
O : Open architecture allowing multiple engines to consume the same data.
T : Tunable without code changes.
Applying these standards, Xiaohongshu reduced resource cost, component count, and development effort each to roughly one‑third of the previous baseline. Incremental compute pipelines achieved 1–2× speed‑up over Spark T+1 for freshness‑per‑5‑minute workloads, and resource consumption for real‑time aggregation fell to about 25 % of a comparable Flink job.
Future directions include deeper AI integration (semantic multimodal governance, Memory Lake hierarchical memory), hybrid‑cloud deployment, and expanding the lakehouse to support both structured and unstructured data with AI‑driven functions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
