Industry Insights 24 min read

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's journey from a ClickHouse‑based batch analytics stack to a unified lakehouse architecture powered by generic incremental computing, showing how the company reduced architecture complexity, resource consumption and development effort each to roughly one‑third while supporting trillions of daily events with sub‑10‑second query latency.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

1. Business and Data Overview

Xiaohongshu is a lifestyle community app with over 350 million monthly active users, generating daily logs in the order of several hundred billion rows. Its business spans community notes, live streaming, e‑commerce and advertising, creating massive real‑time and offline data demands.

Data value is delivered through four categories: (1) analytics for executives and operations, (2) data products for advertisers and merchants, (3) data services such as user‑profile and feature tags for recommendation and search, and (4) AI‑driven insights and automated reporting.

In 2024 the underlying infrastructure migrated from AWS to Alibaba Cloud, moving 500 PB of data, 110 k jobs, involving 1.5 k engineers across 40+ departments – a record‑breaking effort.

2. Evolution of the Data Architecture

2.1 Version 1.0 – ClickHouse‑Based Ad‑hoc Analytics

The initial stack used ClickHouse for instant analytics after an offline Spark‑SQL batch produced wide tables. This reduced query latency from minutes to seconds but introduced three major drawbacks:

High cost: ClickHouse clusters require substantial CPU, memory and storage resources.

Scaling difficulty: ClickHouse’s compute‑storage coupling makes data migration painful during rapid growth.

Staleness: Data arrived via a Spark T+1 pipeline, so business users often saw outdated information.

2.2 Version 2.0 – Lambda Architecture with Storage‑Compute Separation

To address the pain points, Xiaohongshu built a Lambda architecture on top of open‑source ClickHouse, synchronizing MergeTree files to object storage and local SSDs, expanding the time range of queryable data and cutting storage cost.

Key enhancements:

Real‑time data from Flink and batch data from Spark were merged in ClickHouse, delivering day‑level to real‑time metrics.

Materialized views, multi‑type joins and index acceleration improved query efficiency.

Approximately 6 trillion rows per day (≈ 6000 billion) were ingested from Flink into ClickHouse, together with user‑generated profile and tag data for joint analysis.

Performance optimizations included:

Local joins on user‑level data to satisfy feature‑rich queries.

Materialized views covering 70 % of queries, compressing 6000 billion daily rows to ~200 billion.

Bloom‑filter indexes on user IDs for fast look‑ups.

Result: sub‑10‑second response on trillion‑scale data, serving over 200 internal products without manual data‑request tickets.

2.3 Version 3.0 – Lakehouse with Incremental Computing

Version 3.0 introduced a lakehouse that unifies data lake and warehouse, using Flink for ingestion, Iceberg for storage, Spark for batch jobs, and StarRocks for fast queries. The architecture solves three problems of the 2.0 stack:

Dual storage (object storage + ClickHouse) caused cost and consistency issues.

Two compute engines (Flink vs Spark) created semantic gaps and code duplication.

ClickHouse lacked ETL capabilities, making it a dead‑end sink.

Lakehouse benefits:

Iceberg stores raw data; StarRocks provides T+1 analytics on wide tables; real‑time exploration can query Iceberg directly.

Automatic Z‑Order sorting and intelligent re‑sorting reduced scanned data from 5.5 TB to 600 GB (≈ 10× improvement), achieving 80‑90 % query hit‑rate on Z‑Order.

Compression doubled compared with ClickHouse; query latency (P90) fell to ~5 seconds, a 3× speedup over the previous architecture.

Future plans (2025) include deeper AI integration: logical data views to reduce understanding cost, and materialized acceleration to serve AI‑driven suggestions.

3. Generic Incremental Computing

3.1 Definition and Motivation

The classic “data triangle” states that freshness, cost and performance cannot be simultaneously optimized. Batch, stream and interactive processing each favor two of the three. Incremental computing aims to achieve all three by providing a unified, high‑performance, low‑latency processing model – the fourth generation after batch, stream and interactive.

3.2 SPOT Standards

Four criteria define a robust incremental system (SPOT):

S – Full‑data support: every operator must work incrementally, avoiding mixed‑mode pipelines.

P – High performance at low cost.

O – Openness: the system must expose data to multiple engines (e.g., AI, analytics).

T – Tunability: business‑level configuration should adjust behavior without code changes.

3.3 CloudTech Practice on Xiaohongshu

Applying the SPOT principles, CloudTech delivered a solution that reduced resource consumption, component count and development effort each to roughly one‑third:

Resource cost dropped to 1/3 of the previous baseline.

Only one storage layer and one compute engine are needed.

Developers write a single pipeline that serves both real‑time and batch workloads.

4. Incremental Computing in Production

Validation on Xiaohongshu’s core pipelines showed:

Re‑writing Spark jobs to incremental pipelines incurred minimal effort; most scripts could be reused.

Data correctness was verified against existing outputs.

Freshness could be tuned from T+1 to a 5‑minute window, delivering 1‑2× speedup over pure Spark.

For full‑order tables, incremental cost matched Spark; for real‑time aggregation, resource cost fell to ~¼ of Flink.

Additional optimizations:

JSON flattening turned string‑encoded JSON into columnar format, halving storage and speeding queries.

Inverted index with Bloom‑filter on experiment groups gave a 10× boost to date‑skipping queries.

Unified pipeline reduced latency for algorithm teams, enabling rapid A/B feedback without separate real‑time stacks.

Overall, the incremental solution now powers community, search, commerce and other scenarios, delivering the same throughput as a 1800‑core Spark T+1 setup with only a fraction of the resources.

5. Application Summary and Outlook

After the lakehouse rollout, 70 % of front‑line business users can access data directly, and the curated dataset catalog shrank to ~300 core assets, making AI‑driven analysis feasible on a few hundred petabytes rather than the entire data lake.

Key future directions align with industry trends:

Unified batch‑stream (flow‑batch) execution to simplify development.

Further performance gains on Iceberg (e.g., faster Z‑Order rewrites).

AI‑centric data services that automatically generate logical views and materialized accelerations for conversational interfaces.

In summary, Xiaohongshu’s incremental‑compute‑driven lakehouse demonstrates how a large‑scale consumer app can dramatically lower cost, simplify architecture and improve data freshness, providing a practical blueprint for other data‑intensive enterprises.

Big DataXiaohongshudata architecturelakehouseincremental computing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.