How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era
The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.
Xiaohongshu is a lifestyle community app with over 350 million monthly active users. Its business spans community notes, live streaming, e‑commerce and advertising, generating daily logs of several hundred billion events that drive both real‑time and offline analytics.
1.0 Data Architecture : The initial stack relied on ClickHouse for ad‑hoc analysis. After moving data from Spark to ClickHouse, query latency dropped from minutes to seconds, but the architecture suffered from high CPU/memory cost, difficult scaling due to ClickHouse’s compute‑storage coupling, and stale data because Spark batch jobs introduced a T+1 delay.
2.0 Data Architecture : To address 1.0’s pain points, Xiaohongshu introduced a storage‑separated Lambda architecture. ClickHouse MergeTree files were synchronized to object storage and local SSDs, extending the queryable time range and lowering storage cost. Real‑time streams from Flink and batch data from Spark were merged in ClickHouse, delivering near‑real‑time metrics. Daily, about 6 000 billion rows are ingested, including user behavior logs, notes, and tags. Performance optimizations included user‑level local joins, materialized views that captured 70 % of queries (compressing 6 000 billion rows to ~200 billion), and Bloom‑filter indexes for fast user‑specific lookups.
3.0 Lakehouse : The third iteration adopted a Lakehouse model: Iceberg stores raw data in the lake, StarRocks provides fast T+1 analytical queries, and Flink continues real‑time ingestion. Automatic Z‑order sorting and intelligent re‑writes reduced the scanned data volume from >5.5 TB per query to ~600 GB (≈10× reduction), achieving P90 query latency around 5 seconds. Overall, resource cost, component count and development effort each fell to roughly one‑third of the previous generation.
General Incremental Compute : The article defines “general incremental compute” as the fourth generation of data processing that simultaneously satisfies high performance and low latency, extending the classic data‑impossibility triangle (freshness, cost, efficiency). It introduces the SPOT standards – System (full‑stack incremental support), Performance (high performance at low cost), Open (engine‑agnostic data sharing), and Tuning (runtime configurability). Using a Dynamic Table abstraction, Xiaohongshu rewrote Spark jobs as incremental pipelines, achieving 1‑2× speed over Spark batch, comparable cost, and about ¼ of Flink’s resource usage for real‑time aggregation.
Additional Optimizations : Json Flatter was used to flatten and column‑store JSON, halving storage size and boosting query speed. An inverted index for experiment‑group fields delivered a 10× efficiency gain. The unified architecture also supports AI workloads, with plans for stream‑batch convergence, deeper Iceberg‑based query acceleration, and AI‑driven BI that can surface actionable insights from the lake.
The article concludes that Xiaohongshu’s data platform now delivers second‑level analytics over trillion‑scale data, supports over 200 internal products, and is positioned to further evolve alongside industry trends toward integrated stream‑batch lakehouse and AI‑centric data services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
