An Overview of Apache Doris: Architecture, Key Technologies, and Real‑World Use Cases
This article introduces Apache Doris, a massively parallel processing (MPP) distributed database, covering its background, core architecture, key technologies such as reliability, maintenance, MySQL compatibility, materialized views, and real‑world applications at Baidu and beyond.
Doris is a distributed, MPP‑based database designed for interactive analytical queries, originally developed by Baidu and later contributed to the Apache community.
It targets PB‑scale structured data, delivering query latencies in seconds or milliseconds, and is used internally by over 200 Baidu products on more than 1,000 machines.
Use Cases : Baidu Statistics (handling 450 M sites, 2 TB daily ingest, 30 ms query latency) and Baidu Cloud services (TB‑level data, multi‑dimensional analysis).
Architecture : Similar to TiDB, Doris uses a MySQL protocol front‑end (FE) for metadata management and query parsing, and back‑end nodes (BE) written in C++ for storage and execution. Tables are split into tablets with multiple replicas across BE nodes for high availability.
Key Technologies :
Reliability: Memory + Checkpoint + Journal with BTBJE protocol ensures HA and data safety.
Easy Operations: No external dependencies, online schema change, automatic replica balancing, built‑in monitoring via Prometheus/Grafana.
MySQL Compatibility: Supports MySQL network protocol and syntax, allowing use of standard MySQL clients and tools.
MPP Support: Queries are automatically parallelized across nodes; for example, the following SQL is split and executed in parallel: SELECT k1, SUM(v1) FROM A, B WHERE A.k2 = B.k2 GROUP BY k1 ORDER BY SUM(v1)
Data Model: Key‑Value columnar storage with globally ordered keys, enabling fast lookups and efficient aggregation.
Materialized Views: Transparent to users, providing pre‑aggregated data while maintaining atomic updates.
Two‑Level Partitioning & Tiered Storage: Separates hot and cold data across SSD and SATA, simplifying scaling and shard adjustments.
Elasticsearch Integration: Allows creation of external ES tables, supporting full‑text search and multi‑table joins.
Kafka Loading: Directly subscribes to Kafka streams for real‑time ingestion with low latency and automatic partition handling.
Additional Features: Atomicity, multi‑disk support, vectorized execution, UDFs, built‑in HLL type for UV estimation.
Overall, Doris combines MPP performance with MySQL compatibility, making it suitable for both reporting and multi‑dimensional analytical workloads in large‑scale data environments.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.