Designing a Lakehouse with Doris and Paimon: Query Acceleration and Unified Modeling
This article summarizes how the Doris‑Paimon lakehouse architecture leverages Doris' high‑performance OLAP engine to accelerate lake queries, provides a unified data analysis gateway, supports unified data integration, and enables open, layered data modeling for modern big‑data workloads.
It has been a while since the author updated an article; this note serves as a personal summary useful for production practice and interviews, referencing the Doris official website and SelectDB's sharing.
The concept of "lake‑warehouse integration" is familiar, and solutions vary based on technology stacks and business scenarios.
One viable option is the Doris × Paimon (or other lake components such as Hudi) combination, which this article explains in terms of problems solved and capabilities used.
Lakehouse Design Based on Doris × Paimon
Doris considers four application scenarios when designing lake‑warehouse integration:
Lakehouse query acceleration: Doris, as an efficient OLAP engine with MPP vectorized distributed query processing, can directly accelerate analysis of lake data.
Unified data analysis gateway: It provides query and write capabilities for heterogeneous data sources, mapping external sources to Doris metadata for a consistent query experience.
Unified data integration: Through lake source connectors, data from multiple sources can be incrementally or fully synced to Doris, processed, and either served directly for queries or exported downstream, reducing reliance on external tools.
More open data platform: By using open formats like Parquet/ORC and metadata services such as Iceberg or Hudi, data remains accessible to external systems, lowering migration costs and risks.
The Doris × Paimon lakehouse solution focuses on query acceleration and unified modeling , with the latter built upon the former.
Even using Doris solely for query acceleration is worthwhile.
Doris' Own Capabilities
Data Source Integration
Doris supports Multi‑Catalog to connect to major lake and database systems including Hive, Iceberg, Hudi, Paimon, LakeSoul, Elasticsearch, MySQL, Oracle, and SQL Server.
The Doris community has integrated Paimon since version 0.5 and continuously follows its latest features.
Query Acceleration
Doris optimizes remote I/O, data caching, and metadata caching.
I/O optimization: Techniques such as small I/O merging, prefetching, and delayed materialization improve throughput and latency when reading remote data.
Data caching: A lightweight local cache stores hot remote data blocks, allowing query performance comparable to native Doris tables.
Metadata caching: Partition and file list information are cached to avoid frequent remote metadata calls, enabling millisecond‑level query planning for large tables.
Materialized views: Doris can build asynchronous materialized views on Paimon, Iceberg, Hive, etc., storing them in Doris format and automatically routing queries for transparent acceleration.
Benchmarks on a 1 TB Iceberg TPC‑DS dataset show Doris completing 99 queries in one‑third the time of Trino, and in real scenarios reduces average query latency by 20% and 95th‑percentile latency by 50% compared to Presto with half the resources.
More details are available at https://doris.apache.org/zh-CN/docs/lakehouse/lakehouse-overview.
Unified Modeling
The official sharing describes layered data modeling: ODS resides in the lakehouse, while DWD, DWS, ADS layers are processed and served in Doris, leveraging its performance, and can be written back to the lakehouse for backup or further processing.
Although Doris is not expected to replace all traditional data‑warehouse responsibilities, it works well for single‑business or small‑scale scenarios.
The final solution combines Doris, Flink, and Paimon:
Flink and Paimon together support streaming and batch reads/writes in the data‑processing layer.
Doris enables layered processing of Paimon tables and supports write‑back.
These points summarize the key aspects of the Doris × Paimon lakehouse solution.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
