Databases 21 min read

Apache Doris Data Lake Federation Features Overview

This article introduces Apache Doris’s data lake federation capabilities, detailing its lake‑warehouse integration design, supported data sources such as Hive, Iceberg, Hudi, and Elasticsearch, performance optimizations for metadata and file access, case studies, community roadmap, and Q&A on replacing Presto.

DataFunSummit

Jul 18, 2023

Apache Doris Data Lake Federation Features Overview

With the rapid development of data lake technology, query performance has become the main bottleneck for extracting value from massive data. Apache Doris offers a high‑performance query engine that sits on top of a data lake, providing a seamless lake‑warehouse integration that enables fast, easy analytics.

The core ideas behind Doris’s lake‑warehouse integration are fourfold: (1) accelerate lake queries using Doris’s MPP vectorized engine; (2) provide a unified data analysis gateway that abstracts heterogeneous sources; (3) unify data integration by syncing and processing external data within Doris; and (4) create an open data ecosystem that supports open formats like Parquet/ORC and metadata services such as Iceberg, Hudi, Hive Metastore, AWS Glue, and Alibaba Data Lake Formation.

Doris now supports a wide range of data‑lake sources. It can connect to Hive (ExternalTable, ManagedTable) with automatic metadata sync, Iceberg V1/V2 with time‑travel queries, Hudi with copy‑on‑write and merge‑on‑read, and Elasticsearch via a dedicated catalog that auto‑maps indices. JDBC catalog support enables connections to MySQL, PostgreSQL, Oracle, etc., allowing these sources to be treated as dimension tables.

Metadata connection is built on a three‑level hierarchy (catalog → database → table). A unified metadata layer masks differences across sources, while an extensible connector framework simplifies adding new sources. Doris provides efficient metadata caching, real‑time sync via refresh commands or event‑driven updates, and customizable authentication plugins (e.g., Apache Ranger) for fine‑grained access control.

For data access, Doris rewrote the Parquet reader in C++ to eliminate extra memory format conversions, leverage page indexes, Bloom filters, dictionary encoding, and delayed materialization. It also introduced a local file cache with block‑level granularity and consistent hashing to keep cached data close to the query executor, reducing remote I/O and improving stability.

The execution engine was refactored so that the scan layer handles source‑specific logic while the upper planning layer remains identical for internal and external tables, enabling full use of Doris’s optimizer, predicate push‑down, runtime filters, and resource‑aware scheduling. Benchmarks on ClickBench and TPC‑H show 3‑10× performance gains over Trino.

In a financial risk‑control case, Doris unified queries across Elasticsearch, Hive, and Greenplum, eliminating the need for costly data movement and improving timeliness. It also provided a single SQL interface for semi‑structured ES data, streamlining the data‑warehouse service.

Future community plans focus on expanding source support (e.g., Delta Lake, Paimon), enhancing data integration with CDC and materialized views, and introducing pipeline‑based resource isolation and scheduling. Doris 2.0 will also bring compute‑only nodes for elastic scaling and K8s deployment.

Q&A highlights include: Doris cannot yet replace Presto seamlessly but aims to; current BE nodes cannot write back to warehouses, only export data via SELECT ... INTO OUTFILE or EXPORT; and while no direct Spark comparison exists, Doris is expected to outperform Spark in many OLAP scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Engine Data Lake Apache Doris Federated Query

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.