Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics
This article walks through Apache Doris’s lakehouse‑in‑one architecture, explains its core value and paradigm, details the system’s components and use cases, examines technical challenges such as file‑format diversity and I/O stability, and presents a suite of optimizations—from predicate push‑down and partition pruning to metadata caching and dynamic scheduling—that dramatically improve query performance and resource utilization, while also outlining future roadmap plans.
Introduction
The session titled "Apache Doris Lakehouse‑in‑One Technical Analysis" introduces the core value and paradigm of lakehouse technology and outlines the agenda covering lakehouse fundamentals, Doris architecture, use cases, technical challenges, optimization techniques, performance results, future plans, and a Q&A.
Lakehouse Core Value and Paradigm
Lakehouses combine the low‑cost, scalable storage of data lakes (supporting structured, semi‑structured, and unstructured data) with the high‑performance analytics of data warehouses. They use a schema‑on‑read approach for flexibility, but early Hadoop‑based lakes suffered performance bottlenecks that have been mitigated over time.
Traditional lake‑warehouse separation leads to data and application fragmentation, higher costs, and lower efficiency. The lakehouse‑in‑one model unifies storage, metadata, and compute, offering a single API for real‑time queries, batch processing, and AI workloads.
Apache Doris Lakehouse‑in‑One Architecture
Doris serves as a typical lakehouse system, integrating storage layers (S3, HDFS) with file formats (Parquet, ORC) and table formats (Hive, Iceberg, Paimon, Hudi, Delta Lake). Its core components include a high‑performance vectorized engine, pipeline execution, smart cost‑based optimizer, materialized views, JDBC/MySQL access, Arrow Flight for AI, and unified metadata management across multiple catalogs.
Use Cases
Enterprise‑level lakehouse acceleration for existing Hadoop/Hive/Spark ecosystems.
Federated analytics across heterogeneous sources (Hive, MySQL, etc.) using a unified query engine.
Lightweight data‑warehouse scenarios for mid‑size enterprises and CDP deployments.
Technical Challenges
Key challenges include diverse data formats (Iceberg, Paimon, Hudi, Parquet, ORC), unstable I/O performance on object stores, and complex resource management (concurrent queries, inaccurate statistics).
Optimization Techniques
File‑Read Optimizations
Predicate push‑down using min‑max and Bloom filters on Parquet/ORC.
Partition pruning to skip irrelevant partitions.
Delayed materialization to read only necessary columns before applying filters.
Dictionary filtering to compare integer codes instead of strings.
I/O Optimizations
Merge small I/O requests into larger ranges.
Local block caching of remote object‑store reads.
Special handling for tiny files, ORC stripes, and row‑store formats.
Metadata Optimizations
Batch split assignment with locality awareness and consistent hashing.
Multi‑level metadata caches (catalog, schema, partition, file list).
Business Scheduling Optimizations
Join Runtime Filters (JRF) to filter large tables during joins.
Dynamic partition pruning based on runtime filters.
Dynamic priority scheduling to prevent query starvation.
Statistics Optimizations
Collecting statistics from Hive Metastore, Iceberg metadata tables, and JDBC system tables to guide join order, execution strategy, and data‑skew handling.
Other Optimizations
SIMD for fixed‑length fields, reducing virtual function calls, and optimizing nullable handling.
Performance Results
In batch processing (TPC‑DS 10 TB) Doris outperforms Spark by ~30 %. In interactive analytics (TPC‑DS 1 TB) Doris is up to 3× faster than Trino on Iceberg tables and 10× faster on native tables. Real‑world deployments show up to 40 % reduction in 95th‑percentile latency and significant CPU savings.
Future Roadmap
Support for writing to external lake formats (Paimon, Iceberg rewrite, snapshots).
Extended format support (Iceberg V3, Variant, Geo types).
Integration with open data catalogs (Gravitino, Unity Catalog, Polaris).
Q&A Highlights
Answers covered Doris’s write capabilities to Hive/Iceberg, current limitations for CDC streaming, the role of materialized views in balancing freshness versus performance, and upcoming enhancements for write‑path optimizations.
Conclusion
The presentation demonstrates how Apache Doris’s lakehouse‑in‑one design delivers unified, high‑performance analytics while addressing the traditional trade‑offs between data freshness and query speed, and outlines a clear path for future feature expansion.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.