How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
Xiaohongshu, with over 3.5 billion monthly users and daily logs in the trillions, migrated 500 PB of data to Alibaba Cloud and iterated its data platform through four architecture generations—ClickHouse‑based ad‑hoc, Lambda, Lakehouse, and a unified incremental compute model—cutting resource, development, and storage costs to one‑third while delivering sub‑10‑second query latency at petabyte scale.
Xiaohongshu is a lifestyle community app with more than 3.5 billion monthly active users. Its business spans community notes, live streaming, e‑commerce and advertising, all driven by massive data pipelines that generate daily logs of several hundred billion events.
The company classifies data value into four categories: (1) analytical reports for executives, (2) data products for advertisers and merchants, (3) data services such as user profiles for recommendation and search, and (4) AI‑driven insights that generate reports and suggestions.
In 2024 the underlying infrastructure was migrated from AWS to Alibaba Cloud, moving roughly 500 PB of data, executing 110 000 tasks with 1 500 participants across more than 40 departments—setting an industry‑level record for migration complexity.
1.0 Architecture : A simple ClickHouse‑based ad‑hoc analysis layer where offline warehouses produced wide tables that were loaded into ClickHouse. This reduced query latency from minutes (Spark SQL) to seconds (ClickHouse) but introduced three major drawbacks: high cluster cost, difficult scaling due to ClickHouse’s compute‑storage coupling, and data freshness loss because data arrived after a T+1 Spark batch.
2.0 Architecture : A Lambda architecture that decoupled storage and compute. ClickHouse’s MergeTree files were synchronized to object storage and local SSDs, extending the time range of queryable data and lowering storage cost. Real‑time streams from Flink and batch data from Spark were merged in ClickHouse, enabling both day‑level and real‑time metrics. Daily, about 6 000 billion rows are ingested, including user behavior logs, note profiles and tags. Performance optimizations included:
Local joins per user to support feature‑rich analysis.
Materialized views covering 70 % of queries, compressing 6 000 billion rows to roughly 200 billion.
Bloom‑filter indexes on user IDs for fast look‑ups.
These changes delivered second‑level analysis for over 200 internal products without requiring ad‑hoc data requests.
3.0 Lakehouse : To address the dual‑storage and dual‑compute issues of 2.0, Xiaohongshu adopted a Lakehouse built on Iceberg (data lake), Flink (real‑time processing), Spark (batch processing) and StarRocks (query engine). Data from the warehouse (dws wide tables) is stored in Iceberg and queried instantly via StarRocks, while raw ODS data remains available for exploratory analysis. Automatic Z‑Order sorting and intelligent re‑sorting reduced a typical 5.5 TB table scan to about 600 GB—a tenfold improvement—bringing P90 query latency to ~5 seconds. Overall query performance improved threefold compared with the pure ClickHouse solution.
Despite these gains, the Lakehouse still faced challenges: two separate storage layers (object storage and ClickHouse files), two compute frameworks (Flink vs Spark) with semantic differences, and ClickHouse’s lack of native ETL capabilities.
To close the gap, the team defined a unified incremental compute model that satisfies the classic data‑impossibility triangle (freshness, cost, performance). The model follows four SPOT standards:
S – Full‑stack support for incremental operators.
P – High performance at low cost.
O – Open architecture allowing multiple engines to consume the same data.
T – Tunable configuration without code changes.
In practice, incremental compute reduced resource consumption to one‑third, cut component count to one‑third, and lowered development effort to one‑third. Benchmarks showed that incremental tables achieved 1–2× speed‑up over Spark T+1 jobs, matched Spark’s cost for full‑order tables, and required only ~¼ of the resources of traditional Flink jobs for real‑time aggregation.
Additional optimizations included:
Json Flatter to flatten JSON into columns, doubling compression and query speed for unstructured data.
Inverted index with date‑skipping, delivering a tenfold boost for experiments involving thousands of user‑level experiment groups.
Looking ahead, Xiaohongshu plans to deepen AI integration: a unified knowledge base combining structured tables, vector and scalar indexes, and AI functions; logical view consolidation to reduce data‑set count to ~300 core datasets; and materialized acceleration to keep query latency within seconds for AI‑driven BI.
The presentation concludes with a call to explore the incremental compute model further and provides a link to the full slide deck.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
