Integrating Data Lake Technologies with Data Warehouse Architecture at Xiaohongshu: Practices and Performance Optimizations
Xiaohongshu’s data‑warehouse team integrated Apache Iceberg‑based data‑lake techniques into its existing warehouse, replacing the legacy Hive/Spark stack with global sorting, Z‑order, and upsert‑enabled tables, which cut query latency by up to 90 %, boosted data freshness by 50 %, slashed storage costs by 83 % and saved tens of thousands of GB‑hours of compute daily.
In today’s data‑centric business environment, Xiaohongshu’s data‑warehouse team faces massive data processing and analysis challenges. To overcome the speed, flexibility, and cost limitations of traditional data warehouses, the team introduced data‑lake technologies such as Apache Iceberg and combined them with the existing warehouse architecture.
The article first outlines the shortcomings of the legacy Hive/Spark on HDFS stack—high change cost, poor data freshness, slow query performance, and low resource utilization. It then details how Iceberg’s file‑level tracking, asynchronous data re‑organization (e.g., Z‑order), global sorting, and indexing dramatically improve query efficiency and reduce storage costs.
Key practical innovations include:
UBT (User Behavior Tracking) log optimization: Global sorting by point‑ID, min‑max metadata in Iceberg files, and Spark‑level bypass‑hash partitioning cut query latency by 80‑90%.
New split scheme: Automatic conversion of split‑table queries to Iceberg point‑set queries, business‑driven split table creation, and view encapsulation simplify downstream access while eliminating data duplication.
Channel attribution revamp: Replaced offline Discp tasks with a Flink‑Iceberg pipeline, achieving a 90% improvement in data production timeliness and substantial cost savings.
Anti‑crawling log compression: Kafka‑to‑Iceberg ingestion with Parquet compression reduced cross‑cloud transfer volume by 83% and saved 85 minutes of data‑arrival time.
Live‑streaming real‑time pipeline: A unified Flink‑Iceberg architecture delivers near‑real‑time data to both historical lake storage and Kafka, improving latency and resource efficiency.
Upsert support: Iceberg v10 tables enable primary‑key based upserts, allowing incremental and full‑load use cases with efficient “as‑of‑time” queries.
Overall, the integration of data‑lake techniques has delivered measurable benefits: a 50% improvement in data freshness, 83% storage cost reduction, and tens of thousands of GB‑hours of compute saved daily. The team also outlines future directions, including large‑scale sub‑second data solutions, data‑lake + OLAP combos, and continued development of Apache Paimon for upsert‑heavy workloads.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.