Databases 19 min read

An Overview of Apache Doris: Minimal Architecture, Simplicity, Rich Features, and Open‑Source Design

Apache Doris is an open‑source MPP OLAP database that combines a minimalist architecture, ease of use, rich features such as partition‑bucket pruning, materialized views, and bitmap indexes, and provides high‑performance, scalable, and reliable data warehousing for big‑data analytics.

DataFunTalk

Jun 25, 2023

An Overview of Apache Doris: Minimal Architecture, Simplicity, Rich Features, and Open‑Source Design

Doris is an MPP‑based SQL analytical database system that delivers millisecond‑level query response in massive OLAP scenarios. Its implementation architecture originates from Apache Impala and Google Mesa, and has been extensively refactored and optimized into an elegant, high‑performance, feature‑rich, and easy‑to‑use OLAP database.

Figure 1 Doris technical decomposition diagram

Architecturally, Doris has only two types of processes: FE (Frontend) , which acts as the management node handling user request entry, query‑plan parsing, metadata storage, and cluster management; and BE (Backend) , which handles data storage and query‑plan execution. Both process types are horizontally scalable, and Doris does not depend on any third‑party systems such as HDFS or Zookeeper, greatly reducing operational costs.

FE nodes include three roles: Leader, Follower, and Observer. Only one Leader is allowed per cluster, while multiple Followers and Observers can exist. The Leader and Followers form a Paxos group; if the Leader fails, the remaining Followers automatically elect a new Leader, ensuring high availability. Observers sync data from the Leader but do not participate in elections.

FE modules consist of Store Manager, State Store, Coordinator, StoreMeta, and StoreMeta Cache. Store Manager handles all metadata (databases, tables, tablets, replicas) as well as user authentication, authorization, and data import tasks. State Store tracks BE liveness and query load, providing a publish‑subscribe interface. Coordinator receives user requests, parses statements, generates execution plans, and schedules them based on cluster state. StoreMeta manages metadata read/write (only the FE Leader has write permission), and StoreMeta Cache synchronizes metadata for Followers and Observers.

BE nodes are unlimited in number and are peer‑equal. In a sufficiently large cluster, some BEs can go offline without affecting service. Each BE comprises a Store Engine and a Query Executor. The Store Engine manages local tablet data, handles replica transmission, and periodically merges versions to reduce storage usage. It also serves read and bulk‑import requests from the Query Executor. When a query runs on the MPP cluster, it is broken into a tree of PlanFragments; each fragment is assigned to a BE’s Query Executor for execution.

0 2 Simplicity

Doris not only has a simple architecture but is also extremely easy to develop and use. For an OLAP database, performance alone is insufficient; usability determines long‑term adoption. Doris was designed from the start with user friendliness in mind, covering the entire data‑analysis lifecycle: data modeling, data import, user analysis, and ongoing maintenance.

In data modeling, Doris supports Aggregate, Unique, and Duplicate models, satisfying diverse OLAP scenarios. Table creation statements extend MySQL syntax with distributed‑system features such as distribution keys and bucket numbers, making them intuitive for users familiar with MySQL.

For data import, Doris offers multiple methods (see Figure 2). Users can choose the appropriate source and benefit from atomicity guarantees. Both batch imports via Broker Load and single‑row inserts are transactional, ensuring all rows in a batch become visible atomically.

Figure 2 Doris data import options

Each import job generates a unique Label used inside the database to guarantee at‑most‑once semantics; reusing a Label results in an error, enabling exactly‑once data ingestion when combined with upstream at‑least‑once guarantees (see Figure 3).

Figure 3 Doris data import workflow

In SQL development, Doris supports standard SQL with MySQL compatibility, handling simple aggregations to complex joins, subqueries, and window functions. It excels at high‑throughput ad‑hoc queries and in‑database ETL, and can replace offline systems like Hive for TB‑scale workloads. Doris also supports advanced syntax such as GROUPING SETS and extensibility via UDF/UDAF.

Tool integration is seamless: the FE module implements the MySQL protocol, allowing connections via MySQL clients and popular IDEs (DBeaver, DataGrip, Navicat). JDBC/ODBC interfaces enable use from C, Python, Java, Shell, etc. BI tools (FineReport, GuanYuan, YongHong, Tableau) and ETL platforms (Kettle, DolphinScheduler) are also supported.

For cluster reliability, Doris stores metadata in memory with checkpoints and image logs, using a BTBJE (Raft‑like) protocol for high availability. It manages multiple replicas and automatic repair internally, ensuring data durability even when some servers fail. Deployments require only BE and FE modules, with no external dependencies.

Scaling and upgrading are straightforward: adding or removing nodes is a single SQL command, and Doris automatically rebalances data without service interruption. Upgrades involve replacing binaries and rolling restarts; the system is forward‑compatible and supports gray‑scale upgrades.

0 3 Rich Features

Doris offers a wealth of capabilities to suit various application scenarios. Notably, it provides partition‑bucket pruning: data can be partitioned by range or list and further bucketed by hash, enabling queries to touch only a few tablets and dramatically improving concurrency (see Figure 4).

Figure 4 Doris data distribution example

Doris also supports both SQL‑level and partition‑level query caching. SQL‑level caching stores results keyed by the query hash, ideal for infrequently updated but heavily read data. Partition‑level caching intelligently caches results per partition, combining cached and fresh data for subsequent queries, reducing redundant computation.

The system includes a Bitmap data type that stores integers as bitmaps, enabling efficient high‑cardinality deduplication and set operations. Functions such as intersect_count() simplify funnel and retention analysis.

Materialized views are a core feature: pre‑computed result sets are stored as transparent tables, allowing fast queries on fixed dimensions while preserving the ability to query raw detail data. Doris keeps materialized views synchronized with base tables and automatically selects the optimal view during query planning.

Doris also supports primary‑key‑based updates via the Unique model, using a Merge‑on‑Read approach and functions like REPLACE_IF_NOT_NULL for partial column updates. Features such as Marked Delete and Sequence Column enable reliable synchronization with upstream transactional databases.

0 4 Open Source

Doris is fully open‑source under the Apache License 2.0, a widely‑adopted OSI‑approved license that permits free distribution, modification, and commercial use while requiring preservation of original notices and patent grants.

About the author: Wang Chunbo, senior big‑data architect, currently a senior data‑warehouse engineer at an internet company, with extensive experience in banking and retail data‑analysis projects, and author of “Efficient Use of Greenplum: From Basics to Data‑Center”.

DataFun free book giveaway: "Doris Real‑Time Data Warehouse in Practice" – recommended by Apache Doris PMC chair, Select DB founder, and multiple PMC members. The book covers fundamentals, architecture, advanced usage, operations, extensions, and real‑world projects.

Content chapters: The book is divided into four parts: Foundations (chapters 1‑4), Advanced (5‑7), Extensions (8‑10), and Practice (11‑14), covering installation, data import, query optimization, Flink integration, and end‑to‑end data‑warehouse construction.

Target audience: Big‑data architects, data‑warehouse engineers, data‑platform developers, and computer‑science students.

Giveaway rules: Comment your reason for wanting the book; the three comments with the most likes each receive a physical copy. Deadline: 2023‑07‑03 12:00.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Data Warehouse open source OLAP MPP Apache Doris

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.