Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg
This article details Xiaohongshu's data platform engineering, describing how Apache Iceberg is leveraged for real‑time data lake ingestion, CDC pipelines, multi‑cloud storage, small‑file mitigation, schema evolution, and future plans across storage, compute, and management within a big‑data ecosystem.
Introduction The talk presents Xiaohongshu's data‑flow team exploration of Apache Iceberg for real‑time data‑warehouse scenarios. Their multi‑cloud architecture focuses on three directions: large‑scale log real‑time lake ingestion, CDC real‑time lake ingestion, and real‑time lake analytics.
01 Log Data Ingestion Xiaohongshu’s data platform spans multiple public clouds, using AWS S3 for storage, Kafka and Flink for real‑time ETL, and Spark/Hive/Presto for offline analytics. OLAP engines such as ClickHouse, StarRocks, and TiDB serve near‑real‑time queries. The original APM pipeline suffered from small‑file explosion due to uneven partition traffic, causing checkpoint delays and query performance degradation.
By adopting Iceberg’s transactional capabilities, they introduced an asynchronous small‑file merge job and designed an EvenPartitionShuffle algorithm to balance data distribution. Two evaluation metrics— Fanout (number of downstream sub‑task partitions) and Residual (traffic gap from target)—guided the algorithm’s tuning, resulting in more uniform sub‑task loads and a 30‑50% reduction in downstream read latency.
02 CDC Real‑Time Lake Three CDC pipelines are described: (1) MySQL full load via Sqoop into Hive partitions, (2) incremental CDC using a full‑plus‑incremental table pattern, and (3) a streamlined CDC‑to‑Iceberg flow where MySQL binlog is streamed to Kafka, Flink upserts into Iceberg, and schema evolution is handled automatically. Exactly‑once semantics are ensured by ordered binlog emission, fixed‑bucket hashing, and Flink shuffle keys limited to primary‑key subsets.
Iceberg’s upsert sink, merge‑on‑read strategy, and handling of delete‑file overhead are discussed. Optimizations include removing redundant insert events to reduce delete‑file count, leveraging hidden partitions to skip irrelevant data, and implementing auto schema evolution by detecting column additions and restarting writers with the new schema.
03 Real‑Time Lake Analytics The team builds a dual‑write path from Kafka to Iceberg columnar storage, enabling both streaming and batch consumption. Iceberg external tables are integrated with ClickHouse by mapping Iceberg manifests to ClickHouse part.ck files, allowing object‑storage‑backed ClickHouse queries.
Future Planning The roadmap targets three pillars: storage (optimizing Cloud‑Native FileIO, reducing checkpoint spikes, improving cross‑cloud stability), compute (integrating more engines such as Spark, ClickHouse, StarRocks, and adding secondary indexes for data skipping), and management (service‑ifying Iceberg maintenance jobs, intelligent job orchestration, and conflict‑reduction strategies).
Overall, the presentation showcases practical solutions to real‑time data lake challenges at scale, emphasizing Iceberg’s role in transactionality, schema evolution, and multi‑cloud operability within a big‑data ecosystem.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.