How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era
Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.
Xiaohongshu is a lifestyle community app with over 350 million monthly active users. Its business spans community notes, live streaming, e‑commerce and advertising, generating daily log volumes of several hundred billion events and demanding both real‑time and offline data processing.
1.0 Data Architecture (ClickHouse‑based ad‑hoc analysis) – Data was first ingested into a Spark batch pipeline (T+1) and then loaded into ClickHouse for interactive queries. This reduced query latency from minutes to seconds but introduced three major drawbacks: high cluster cost, difficult scaling due to ClickHouse’s compute‑storage coupling, and stale data because of the batch delay.
High cost: ClickHouse clusters require substantial CPU and memory resources.
Scaling challenges: data migration is needed when the business expands.
Data freshness: the Spark → ClickHouse pipeline adds a non‑trivial latency.
2.0 Data Architecture (Lambda with storage separation) – Xiaohongshu introduced a Lambda pattern that synchronises ClickHouse’s MergeTree files to object storage and local SSDs, extending the time range of queryable data and lowering storage cost. Real‑time data from Flink and offline data from Spark are merged in ClickHouse, achieving day‑level to real‑time insight. Daily, roughly 6 × 10¹² rows are streamed into ClickHouse, and materialised views cover about 70 % of queries, compressing the 6 × 10¹² daily rows to ~2 × 10¹¹.
Local join on user‑level partitions for behavioural analysis.
Materialised views on enumerated fields (≈70 % hit rate).
Bloom‑filter indexes on user IDs for fast look‑ups.
These changes delivered second‑level analysis on trillion‑scale data and eliminated the need for manual data requests from downstream product teams.
3.0 Lakehouse (Iceberg + StarRocks + Flink + Spark) – To address the remaining pain points—dual storage (object store vs ClickHouse), dual compute engines (Flink vs Spark) and lack of ETL in ClickHouse—the team adopted a lakehouse architecture. Flink handles log ingestion, Iceberg stores raw data, Spark runs batch jobs, and StarRocks provides fast T+1 analytics on the dws wide tables. Automatic Z‑Order sorting and intelligent re‑sorting reduce scanned data from 5.5 TB to 600 GB (≈10× improvement) and keep 80‑90 % of queries hitting the Z‑Order layout. Compression on lake storage doubled compared with ClickHouse, and overall query performance improved three‑fold, keeping P90 latency around 5 seconds.
In 2025 the team explored a Snowflake‑like incremental compute model, defining a Dynamic Table that encapsulates all ETL logic (union, joins, etc.). Validation on core offline and real‑time pipelines showed:
Incremental tables are 1‑2× faster than Spark batch for the same freshness interval.
Cost is comparable to Spark for full‑order tables.
Real‑time aggregation jobs consume roughly ¼ of the resources of traditional Flink jobs.
Additional optimisations include JSON Flatter (columnar storage of JSON, halving compression size), inverted‑index‑based experiment‑group queries (10× speedup), and unified architecture that lowers development and maintenance overhead.
General Incremental Compute and SPOT Standards – The article defines incremental compute as the fourth generation of data processing that satisfies the classic “data impossibility triangle” (freshness, cost, performance). The SPOT criteria are:
S : Uniform full‑data expression support for all operators.
P : High performance with low cost.
O : Open architecture allowing multiple engines to consume the same data.
T : Tunable without code changes.
Applying SPOT, Xiaohongshu achieved roughly one‑third of the previous resource, component, and development costs while maintaining the required performance and freshness.
Looking ahead, the roadmap aligns with industry trends: unified stream‑batch processing, deeper lake‑warehouse integration, and AI‑driven data services that provide logical views, semantic governance, and materialised acceleration for downstream AI models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
