Transforming Real‑Time Analytics: Incremental Computing with Lakehouse Architecture
This article examines how Xiaohongshu replaced its costly Lambda architecture with a real‑time lakehouse built on Iceberg, Paimon, Spark, and StarRocks, achieving minute‑level latency, higher data quality, lower resource consumption, and dramatically faster query performance.
Hello, today I share a recent Xiaohongshu article about real‑time lakehouse incremental computing (original link provided).
I studied the article and recorded the key insights for reference.
Background
Xiaohongshu is a typical UGC content platform where many companies have similar scenarios. Its main workload processes user behavior logs (views, likes, saves) across recommendation, search, and e‑commerce, generating billions of incremental records daily. Algorithms require minute‑level tuning and full‑volume calculations, and real‑time and offline metrics must be consistent with less than 1% difference. The two core business demands are minute‑level timeliness and full‑volume accurate computation.
Current Problems
Before adopting a lakehouse and incremental computing, Xiaohongshu used the classic Lambda architecture (separate offline and real‑time pipelines), encountering classic issues:
High cost : Flink jobs run as long‑living tasks, consuming over 5000 cores; large state (e.g., wide windows) creates memory pressure and costs scale linearly with traffic.
Complexity of two pipelines : Real‑time chain (Flink+Redis+ClickHouse) and offline chain (Spark+Hive) are logically separated, making dimension‑table updates and metric consistency difficult; KV storage becomes a bottleneck.
High development cost and risk : Very large windows (e.g., 7‑day) cause state explosion, forcing smaller windows that hurt data quality; frequent schema changes lengthen development cycles.
Solution and Technical Details
The final solution adopted:
Iceberg + Paimon for data storage and consumption.
Spark minute‑level scheduling to produce minute‑level summary data.
StarRocks to read lake data for accelerated queries.
Architecture diagram:
Technical details:
Minute‑level DWS design : At the model layer, data is designed with granularity; raw logs are transformed into a 5‑minute + user granularity DWS layer, and minute‑level jobs join user dimension tables, drastically reducing overall data volume.
Real‑time dimension table : User Kafka streams update dimension tables at minute granularity; daily updates are written by offline jobs; timestamps in tables drive on‑demand updates.
Schema design : When metrics are added or removed, a JSON column stores algorithm metrics, allowing users to self‑service metric changes without schema evolution, improving development efficiency.
Dimension table design : Experiment dimension table is stored as where exp_ids is an array; an inverted index on exp_ids greatly speeds up queries.
Benefits
Timeliness and data quality : Achieved minute‑level latency with dynamic computation periods, and reduced data discrepancy between real‑time and offline results to less than 1%.
Computation and iteration cost : Near‑real‑time pipeline consumes only about 36% of the original resources; pre‑aggregation compresses billions of logs to hundreds of millions, cutting storage cost by roughly 90%; JSON semi‑structured modeling avoids table alterations, boosting development efficiency by over 50%.
Query performance : The minute‑level DWS layer (5‑minute + user aggregation) turns detailed queries into aggregated result queries, reducing P90 latency from minutes to under 10 seconds.
That concludes the sharing; hope it helps.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
