Near‑Real‑Time Data Warehousing with Yunqi Lakehouse: Cases from Xiaohongshu, Kuaishou, Meituan
The article examines how Xiaohongshu, Kuaishou and Meituan adopted Yunqi Lakehouse’s General Incremental Computing and Single‑Engine architecture to achieve near‑real‑time data warehouses, cutting resource usage to as low as 1/20 of full‑batch jobs, reducing data latency from days to minutes, and improving query performance.
Background and Motivation
Rapid growth in data volume and AI‑driven data demands have pushed leading internet companies to require sub‑second response times, petabyte‑scale queries, and high concurrency. Traditional Lambda architectures with separate offline and real‑time pipelines struggle with high cost, data staleness, and duplicated code bases.
Yunqi Lakehouse Overview
Yunqi Lakehouse focuses on a single‑engine design powered by General Incremental Computing (GIC). GIC processes only changed data, merging results into existing state, while a cost‑based dynamic execution engine selects optimal plans based on query complexity, change volume, and schedule frequency.
Case 1: Xiaohongshu – Near‑Real‑Time Experiment Data Warehouse
Xiaohongshu processes billions of daily log entries and needs fresh experiment metrics every 30 minutes. Their previous Lambda setup incurred >5% discrepancy between real‑time and offline data, required sampling, and maintained two code bases. By adopting GIC and a unified Single‑Engine architecture with Iceberg tables and standard SQL, they achieved:
Data freshness reduced from daily to 5‑minute intervals.
Real‑time and offline metric differences narrowed to <1%.
Resource consumption dropped to 36% of the prior real‑time pipeline.
Case 2: Kuaishou – EB‑Scale Incremental Computing
Kuaishou handles exabyte‑scale data with millions of CPU cores. Full‑batch recomputation for 1% data changes was infeasible. Their GIC tests covered simple, medium, and complex scenarios (the latter involving >10 tables, dozens of joins, and window operators). Results:
Simple scenario resource usage <1/20 of full batch.
Medium scenario resource usage <1/3 of full batch.
Complex scenario achieved 30‑minute latency with stable performance.
Reduced risk of “break‑line” failures during peak periods.
Case 3: Meituan – BI Platform Performance Boost
Meituan’s BI platform serves millions of queries daily, requiring both sub‑second interactive analytics and deep PB‑scale exploration. After integrating Yunqi Lakehouse in a gray‑scale rollout, they observed:
Two‑fold performance improvement over a comparable Trino cluster for complex queries across 84 tables and >1,000 TB of data.
Stable QPS growth up to 80 (online peak <10) with consistent latency.
Seamless access to existing HDFS data via external tables, no data migration needed.
Key Technical Insights
All three cases share two core advantages:
General Incremental Computing (GIC) : By calculating only the delta and dynamically choosing execution plans, the engine adapts to both low‑change (1% delta) and high‑change workloads without falling back to full recomputation.
Single‑Engine Architecture : A unified engine handles batch, streaming, interactive, and incremental workloads using Iceberg open format and standard SQL, allowing near‑zero code changes when moving logic between pipelines.
The architecture also benefits from vectorized C++/SIMD execution, a three‑tier cache (memory, SSD, object storage) achieving >95% hit rates, and elastic scaling from 0 to over 100 instances.
Ecosystem Compatibility
Yunqi Lakehouse natively supports Iceberg, standard SQL, Hive UDFs, and enterprise security integrations such as Kerberos, enabling large‑scale adoption without extensive refactoring of existing pipelines.
Implications for Data+AI
Higher data freshness directly improves AI model training effectiveness, as demonstrated by Kuaishou’s plan to feed incremental data into ad‑ranking models. Meituan’s natural‑language data assistant also relies on timely, accurate metric layers.
Overall, the successful deployments at Xiaohongshu, Kuaishou, and Meituan validate that GIC and a Single‑Engine approach can meet the stringent performance, cost, and scalability demands of today’s leading internet services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
