Databases 20 min read

Design and Technical Details of Apache Doris for Lakehouse Architecture

This article explains how Apache Doris extends its real‑time OLAP capabilities to support Lakehouse architectures, covering unified metadata, query acceleration, elastic compute, performance benchmarks, and future roadmap for richer data‑source integration and resource isolation.

DataFunTalk

Mar 21, 2023

Design and Technical Details of Apache Doris for Lakehouse Architecture

Introduction

Data warehouse was first defined by Bill Inmon in the early 1990s as a subject‑oriented, integrated, relatively stable collection of historical data for decision‑making. Data lakes were later created to store massive heterogeneous data that traditional warehouses could not handle, leading to the emergence of the Lakehouse paradigm which combines strong analytical capabilities with the flexibility of lakes.

Benefits of Lakehouse

Unified data integration avoids redundant storage and heavy ETL pipelines.

Supports ACID, schema evolution, and snapshots for comprehensive governance.

Provides multi‑engine access, plug‑in architecture, and suitability for diverse workloads.

Apache Doris in Lakehouse

Apache Doris, an open‑source real‑time OLAP database, began experimenting with Apache Iceberg in version 0.15 and has since added support for multiple lake table formats (Iceberg, Hudi, Delta Lake). It now offers high‑performance query acceleration, a unified data‑analysis gateway, and robust data integration capabilities, validated in real‑world scenarios.

Key Features

Lakehouse Query Acceleration : Leveraging Doris’s distributed execution engine and local file cache, queries on lake data achieve several‑fold speedups compared with Hive, Presto, or Spark.

Unified Data Analysis Gateway : An extensible catalog framework enables rapid connection to relational databases, data warehouses, and lake engines such as Hive, Iceberg, Hudi, Delta Lake, and Flink Table Store.

Unified Data Integration : Supports both incremental and full‑copy synchronization, data processing, and write‑back to upstream sources, acting as a central data hub.

Open Data Ecosystem : Native support for open formats like Parquet and ORC eliminates vendor lock‑in and reduces migration costs.

Metadata Connection

Doris introduces a Catalog layer (Internal Catalog for native tables and External Catalog for external sources) that provides a unified metadata structure, an extensible connection framework, efficient metadata access, and customizable authentication plugins. Users can switch catalogs with the SWITCH statement and perform cross‑source federated queries.

SELECT * FROM hive.db1.tbl1 a JOIN iceberg.db2.tbl2 b ON a.k1 = b.k1;

Data Access Optimizations

The new Native File Format Reader eliminates double conversion by directly converting Parquet/ORC files to Doris’s in‑memory format, utilizes page‑level indexes, supports predicate push‑down and lazy materialization, and performs data pre‑fetching to reduce remote I/O.

File cache introduces block‑level caching with adaptive block sizes (4 KB–4 MB) and consistent‑hashing for cache placement, dramatically lowering read latency and providing near‑local‑table performance.

Execution Engine and Optimizer

Scan node refactoring separates generic query operators (Join, Sort, Agg) from source‑specific access logic, allowing new data sources to be integrated in roughly one person‑week. The optimizer gathers statistics from external catalogs (Hive Metastore, Iceberg, Hudi) and refines cost models, delivering better execution plans for complex federated queries.

Performance Comparison

Benchmarks on ClickBench (wide‑table) and TPC‑H (multi‑table join) show Apache Doris consistently outperforms Presto/Trino by 3–10× on identical hardware and data sets.

Elastic Compute and Load Management

In the upcoming 2.0 release, Doris adds stateless elastic compute nodes dedicated to external‑source queries, enabling rapid horizontal scaling, Kubernetes‑native scheduling, and fine‑grained resource isolation (CPU, I/O, memory).

Case Study

A financial risk‑control platform migrated from Greenplum/CDH to Doris in 2022. By creating a single Hive catalog, thousands of Hive tables became queryable with dramatically improved performance; an Elasticsearch catalog delivered sub‑second analytics, and decoupling batch processing from statistical analysis reduced resource consumption and increased system stability.

Future Roadmap

Future work focuses on richer data‑source support (Hudi MoR incremental queries, Iceberg/Hudi indexes, Delta Lake, Flink Table Store), deeper data integration (CDC, incremental materialized views, Git‑like versioning), and finer‑grained resource isolation and scheduling.

Community

The Lakehouse SIG gathers developers from multiple enterprises to advance Doris’s Lakehouse capabilities. SelectDB, the commercial backing company, provides cloud‑native real‑time data‑warehouse services and contributes to the Apache Doris ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data query optimization data-warehouse Lakehouse Apache Doris

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.