Databases 18 min read

Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization

The article describes how Kuaishou transformed its high‑traffic OLAP system from a separated lake‑and‑warehouse architecture using Hive/Hudi and ClickHouse into a unified lakehouse solution powered by Apache Doris, detailing the challenges, design choices, caching and automatic materialization mechanisms, and the resulting performance and governance improvements.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization

Kuaishou's OLAP system serves billions of daily queries across internal and external scenarios. The original architecture separated the offline data lake (Hive/Hudi) and the real‑time data warehouse (ClickHouse), leading to storage redundancy, resource contention, complex governance, and difficult query tuning.

Problems

Redundant storage caused inefficiencies and delayed data readiness.

ClickHouse consumed cluster resources during data ingestion and compaction, creating contention under high concurrency.

Data engineers faced heavy manual effort to build and maintain ADS models for ClickHouse ingestion, increasing operational complexity.

Query performance degraded as ClickHouse scaling limits were reached, with high learning and operational costs for optimization.

Upgrade Goals and Selection

To eliminate the lake‑warehouse split, Kuaishou sought a lakehouse‑integrated architecture that could query lake data directly without costly data movement. Apache Doris was selected for its real‑time analytical capabilities, materialized‑view rewrite, and emerging lakehouse features.

Lakehouse‑Integrated Architecture Based on Apache Doris

The new architecture consists of three layers:

Data Processing Layer: Data is ingested into the lake (Hive/Hudi) and processed from ODS to DWS, with DWS‑to‑ADS materialization handled by an automatic materialization service.

Data Caching Layer: ADS data is cached in Alluxio for low‑latency access, while metadata is cached in a dedicated metadata service.

Data Query Layer: Apache Doris provides high‑performance queries over the ADS layer.

Cache Service and Optimization

Kuaishou implemented both metadata and data caching. A custom Meta Server listens to changes from Hive Metastore and Alluxio, persists them to a Meta Store, and pushes updates to Doris FE's catalog, ensuring consistency.

Metadata cache reduced average lookup latency from ~800 ms to ~50 ms. Data cache uses Alluxio; Doris checks the is_cached flag to decide whether to read from Alluxio or HDFS.

Automatic Materialization System

Materialized view discovery combines expert rules (e.g., dimensions city, gender and metrics sum(time), count(distinct uid)) and historical query analysis to propose view definitions such as:

select sum(time), count(distinct uid) from db.tbl group by city, gender

Discovered definitions are stored in Meta Server, then submitted to a job scheduler that builds and refreshes materialized views, adjusting priorities based on data lineage.

During query execution, Doris registers materialized views of type KwaiMTMV from Meta Store and rewrites queries to use them, e.g., transforming a count distinct operation into an efficient bitmap aggregation on the appropriate view.

Lakehouse Query Optimization

External table statistics are collected via Spark and served from Meta Server to guide the optimizer.

Parquet files are written with sorted keys and optimal RowGroup sizes to maximize predicate push‑down.

Bucketed tables enable Doris to generate colocated aggregation and join plans, reducing shuffle overhead.

Conclusion

By migrating to Apache Doris, Kuaishou unified storage and query processing, eliminated data duplication, and achieved lower latency and higher throughput. The combination of Doris's materialized‑view rewrite and Kuaishou's automatic materialization service delivers flexible data governance and high‑performance analytics. Future work includes expanding Doris to replace Presto for ad‑hoc queries and further migrating internal dashboards to the lakehouse architecture.

Big DataQuery OptimizationOLAPmaterialized viewApache DorisData Caching
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.