Big Data 9 min read

Lakehouse Implementations at Leading Companies: Challenges, Solutions, and Benefits

This article reviews how major tech firms such as Alibaba, Tencent, ByteDance, and Kuaishou tackled lakehouse challenges—including architecture fragmentation, cost, scalability, and complex multimodal data—by adopting real‑time lakehouse solutions like Flink + Paimon, Iceberg + StarRocks, Hudi + LAS, and Doris + Alluxio, and outlines the resulting performance and cost gains.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Lakehouse Implementations at Leading Companies: Challenges, Solutions, and Benefits

Background

Major internet companies face severe issues with traditional Lambda architectures, such as fragmented offline and real‑time pipelines, high latency, excessive storage redundancy, and difficulty handling multimodal data at massive scale.

Problems to Solve

Efficiency bottlenecks caused by architecture split.

Cost and scalability pressure.

Increasing business complexity.

Key Technical Solutions and Practices

1. Alibaba (Flink + Paimon)

Storage layer: Paimon provides primary‑key upsert and high‑throughput append tables, integrated with Pangu storage for exabyte‑level data management.

Compute layer: Flink handles real‑time ETL, while Dolphin (based on Hologres) offers low‑latency queries.

Optimization: Asynchronous compaction, larger checkpoint intervals, dynamic batch‑size memory tuning.

Key techniques: HLL sketch for approximate UV, SST format extension achieving 5 × 10⁴ QPS and <70 ms latency.

Benefits: 60 % reduction in compute resources, 75 % storage cost saving, 10× faster real‑time feature production, and a 2 %+ CTR model improvement.

2. Tencent Video (Iceberg + StarRocks)

Lakehouse foundation: Iceberg ensures transactional writes; StarRocks provides compute‑storage separation for hot‑cold data layering.

Development model: Unified SQL‑in‑Jar framework for stream‑batch development; Flink Batch for historical data repair.

Optimization: Materialized views replace traditional ETL, yielding 3‑65× query performance gains.

Key techniques: Unified metric service managing 2000+ indicators via MQL; SLA improvements reducing task delay from 8 % to 1.2 % and alert response to 15 minutes.

Benefits: 50 % faster development, 99.9 % metric consistency, 80 % reduction in cold‑data storage cost, and sub‑second query latency.

3. ByteDance (Hudi + LAS)

Storage layer: Custom‑enhanced Hudi with second‑level visibility, backed by BTS memory acceleration for tens of millions RPS writes.

Compute layer: Spark, Flink, and Presto cooperate; Presto achieves three‑fold performance over open‑source.

Optimization: Row‑column hybrid storage with secondary indexes cuts I/O by 40 %.

Key techniques: Schema evolution for real‑time user‑profile updates; intelligent materialized views auto‑generate ADS layers, boosting resource utilization by 30 %.

Benefits: 50 % fewer real‑time warehouse components, 1.85 hour faster debugging per requirement, 60 % storage cost reduction, and query response time dropping from hours to minutes.

4. Kuaishou (Doris + Alluxio)

Lakehouse interaction: Doris queries Hive/Hudi directly, with Alluxio caching achieving 85 % hot‑cold data hit rate.

Automation: Historical queries auto‑generate ADS layers; obsolete models are automatically decommissioned.

Optimization: Consistent hashing task distribution raises cache hit to 70 %.

Key techniques: Colocation join avoids shuffle, doubling join performance; global statistics collection enables CBO‑optimized plans.

Benefits: 80 % reduction in data sync tasks, 60 % lower ADS model maintenance cost, query latency staying under 100 ms during peak events, and 40 % higher cluster resource utilization.

Key Technical Comparisons

Alibaba focuses on millisecond‑level decision making with Paimon’s change‑log mode and Dolphin point‑lookup optimizations; Tencent emphasizes metric consistency via a unified metric middle‑platform and StarRocks compute‑storage separation; ByteDance prioritizes flexibility with schema‑evolution and intelligent materialized views; Kuaishou targets cost efficiency through Doris‑Alluxio integration and automated ADS generation.

Reference Cases

"Alibaba Mama: Lakehouse Practice with Flink + Paimon"

"From Zero to One: Apache Doris Lakehouse Solution"

"Tencent Video Metric Platform Driving Lakehouse Integration"

"ByteDance’s Lakehouse Solution Based on Apache Hudi"

"ByteDance Best Practices in Lakehouse Integration"

"Doris Lakehouse Practice at Kuaishou"

"10× Efficiency: Paimon + Dolphin Lakehouse Architecture at Alibaba Mama"

FlinkPaimonlakehousedoris
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.