How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing
This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.
Background
Xiaohongshu is a lifestyle community with over 350 million monthly active users, generating billions of logs daily. The massive user base and diverse business scenarios (social feeds, live streaming, e‑commerce) create demanding real‑time and offline data requirements, prompting a continuous evolution of its data platform.
Evolution of Xiaohongshu Data Architecture
1.0 – ClickHouse‑based ad‑hoc analysis
Data processed in Spark with a T+1 delay, then loaded into ClickHouse for interactive queries.
High infrastructure cost due to ClickHouse's CPU‑intensive cluster.
Scaling difficulty because ClickHouse combines storage and compute.
Latency issues: data freshness suffered by the Spark‑to‑ClickHouse pipeline.
2.0 – Lambda architecture with Flink + Spark
To address 1.0’s shortcomings, Xiaohongshu introduced a Lambda architecture that merges real‑time streams from Flink and batch data from Spark into ClickHouse. ClickHouse’s MergeTree files were synchronized to object storage and local SSDs, expanding the queryable time range and lowering storage cost. Materialized views, local joins per user, and Bloom‑filter indexes were added to accelerate common queries on the 6 trillion‑row daily ingest.
3.0 – Lakehouse (Iceberg + StarRocks)
Recognizing the need for a unified storage‑compute layer, Xiaohongshu adopted a lakehouse built on Iceberg for data lake storage, Flink for streaming ingestion, Spark for batch jobs, and StarRocks for fast analytical queries. The lakehouse eliminated duplicate storage, unified metadata with Gravitino and OpenLineage, and introduced automatic Z‑Order sorting and intelligent index tuning, shrinking scanned data from 5.5 TB to ~600 GB per query (≈10× improvement) and achieving sub‑5‑second P90 latency on petabyte‑scale workloads.
General Incremental Computing
Incremental computing is presented as the fourth generation of data processing, aiming to satisfy the “data triangle” (freshness, cost, performance) by providing a single, unified data expression (SQL or Python) that can be tuned via configuration without code changes. The SPOT standards define its requirements:
S : Full‑stack support for all operators in both batch and incremental modes.
P : High performance with low cost.
O : Open architecture allowing multiple engines to consume the same data.
T : Flexible, configuration‑driven tuning.
In Xiaohongshu’s pilot, the incremental model reduced resource usage to one‑third, cut the number of platform components by two‑thirds, and lowered development effort to a single pipeline.
Performance Optimizations
Local joins per user to keep feature‑rich queries on the same node.
Materialized views covering 70 % of queries, compressing 6 trillion daily rows to ~200 billion.
Bloom‑filter indexes on user IDs for rapid look‑ups.
Automatic Z‑Order rewriting of Iceberg files based on query patterns, achieving 80‑90 % Z‑Order hit rate.
Json Flatter to column‑store semi‑structured data, halving storage size and boosting query speed.
Inverted index (Date Skipping) for experiment‑group data, delivering a 10× query speedup.
Future Outlook
By 2025 Xiaohongshu plans to deepen AI integration, turning the lakehouse into an AI‑ready knowledge base with logical views, unified metadata, and materialized acceleration. The goal is to provide AI‑driven insights across 300 curated datasets, ensuring high data freshness while keeping costs and latency within acceptable bounds.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
