How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.
Xiaohongshu is a lifestyle community app with over 350 million monthly active users, generating daily logs of several hundred billion events that drive its community, e‑commerce and commercial services.
The original 1.0 data architecture relied on ClickHouse for ad‑hoc analysis. While ClickHouse reduced query latency from minutes to seconds compared to Spark SQL, it suffered from high resource cost, difficult scaling, and data staleness caused by T+1 batch processing.
High cost: ClickHouse clusters require substantial CPU and memory.
Scaling challenges: storage‑compute integration makes data migration hard during rapid growth.
Staleness: data processed in Spark and then loaded into ClickHouse incurs delay.
To address these issues, the 2.0 architecture introduced a storage‑separated Lambda design. ClickHouse MergeTree files were synchronized to object storage and local SSDs, extending the queryable time range and lowering storage cost. Real‑time data from Flink and batch data from Spark were merged in ClickHouse, enabling day‑level to real‑time insights. Daily, about 6 000 billion user‑behavior rows are streamed from Flink into ClickHouse, along with user‑profile and tag data for joint analysis.
Performance optimizations in 2.0 included:
Local joins on user‑level data.
Materialized views covering 70 % of queries, compressing 6 000 billion daily rows to ~200 billion.
Bloom‑filter indexes on user IDs for fast look‑ups.
These changes delivered sub‑10‑second query latency on datasets exceeding 10 000 trillion rows, supporting over 200 internal products without requiring data‑request tickets.
However, the 2.0 architecture still maintained two storage layers (object storage and ClickHouse files) and two compute engines (Flink and Spark), leading to higher storage cost, data inconsistency risk, and duplicated ETL logic.
The 3.0 lakehouse architecture unified storage and compute by adopting Iceberg as the lake, StarRocks for fast T+1 queries, and Flink for ingestion. Automatic Z‑Order sorting and intelligent re‑sorting reduced scanned data from 5.5 TB to ~600 GB per query (≈10× improvement) and achieved 80‑90 % query hit rate on Z‑Order files. Overall query performance improved threefold compared to the ClickHouse‑only setup, with P90 latency around 5 seconds.
In parallel, Xiaohongshu defined “generic incremental compute” and the SPOT standards (System, Performance, Open, Tunable). The system must support full‑stack incremental operators, deliver high performance at low cost, be open to multiple engines, and allow configuration‑driven tuning without code changes.
Applying these standards, Cloudware’s implementation reduced resource consumption to one‑third, component count to one‑third, and development effort to one‑third. Incremental compute pipelines achieved 1‑2× speedup over Spark batch while matching cost, and real‑time aggregation workloads used only ~¼ of the resources of traditional Flink jobs.
Additional optimizations such as JsonFlatter for columnar JSON storage and inverted‑index‑based date skipping yielded a 10× boost in specific query patterns. Compression on lake storage doubled compared to ClickHouse, and overall query throughput increased threefold.
Looking ahead, Xiaohongshu plans to further integrate AI‑driven analytics on top of the lakehouse, refine Z‑Order and indexing strategies, and explore faster query paths under Iceberg while maintaining the unified SPOT‑compliant incremental compute framework.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
