Databases 22 min read

Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability

This article provides a comprehensive overview of Apache Doris, explaining its frontend and backend nodes, storage structures such as tablets, rowsets, and segments, replication mechanisms, partitioning versus bucketing, indexing types, compaction processes, and high‑availability strategies through a detailed Q&A format.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability

Apache Doris is a distributed analytical database that separates query processing (frontend, FE) from data storage (backend, BE). This article explains Doris's core concepts, storage layout, replication, indexing, compaction, and high‑availability mechanisms in a question‑and‑answer style.

Terminology

FE : Frontend node, handles client requests, metadata, query planning, and cluster management.

BE : Backend node, responsible for data storage, management, and query execution.

BDBJE : Oracle Berkeley DB Java Edition, used by Doris for persisting metadata logs and FE high‑availability.

Tablet : Physical storage unit of a table; a table is divided into partitions and buckets, each bucket corresponds to a Tablet.

RowSet : A collection of data changes (imports, deletes, updates) for a Tablet; each RowSet contains one or more Segments.

Version : Consists of Start and End fields, representing the version range of a RowSet.

Segment : A data slice inside a RowSet; multiple Segments form a RowSet.

Compaction : Merges consecutive RowSets to reduce file count and improve query performance.

Key / Value columns : Columns designated as KEY (dimensions) or VALUE (metrics) in a table definition.

Data model : Doris supports three models – AGGREGATE, UNIQUE, and DUPLICATE.

Base table : User‑created table that stores raw data.

ROLLUP table : Materialized view built on a Base table, stored independently.

Q1: What is the difference between Partition and Bucket?

Doris uses two‑level data division. The first level, Partition , supports RANGE and LIST partitioning similar to MySQL and is the logical smallest management unit. The second level, Bucket (also called Tablet), supports HASH and RANDOM distribution; each Tablet stores a disjoint set of rows and is the physical smallest storage unit.

Partitions can be omitted; if no partition clause is provided, Doris creates a default transparent partition.

Q2: Why does Doris need Bucketing?

Bucket (Tablet) distribution reduces data skew, enables IO parallelism, and allows different Tablet replicas to be placed on different machines, improving query performance.

Q3: How are physical files stored?

Each import creates a RowSet, which contains multiple Segments. BE stores Segment files under

${storage_root_path}/data/${shard}/${tablet_id}/${schema_hash}/${rowset_id}_${segment_id}.dat

. The path components are: ${shard}: Randomly created sub‑directory under the storage root. ${tablet_id}: Bucket identifier. ${schema_hash}: Version identifier for a table schema. ${rowset_id} and ${segment_id}: RowSet and Segment identifiers.

Segment files are split into a Data Region (column data, 64 KB pages), an Index Region (per‑column indexes), and a Footer (metadata and checksum).

Q4: DML restrictions for different table models

Update : Supported only for UNIQUE model, and only on Value columns.

Delete : For AGGREGATE and UNIQUE models, Delete can only specify conditions on Key columns and also removes data from associated Rollup indexes.

Insert : Allowed for all three models.

Insert implementation

AGGREGATE : New data is appended as a RowSet; queries use Merge on Read to aggregate duplicate keys at read time.

DUPLICATE : Similar to AGGREGATE but without any aggregation during reads.

UNIQUE : Since version 1.2, Doris uses Merge on Write : during insert, existing rows with the same key are marked deleted in a Delete Bitmap, and the new row is written to a new RowSet, making reads return only the latest version.

Update implementation

Update for the UNIQUE model is performed as a Select + Insert workflow: the engine filters rows to be updated, marks the old rows in the Delete Bitmap, and writes the new rows as a new RowSet.

Q5: How is Delete implemented?

Delete also generates a RowSet that records delete conditions in metadata. During Base Compaction, these conditions are merged into the base version. In the UNIQUE model, bulk deletes can be performed via LOAD_DELETE, which writes a delete flag into the data and later compacts it away.

Q6: What indexes does Doris provide?

Doris supports two categories of indexes:

Built‑in intelligent indexes: Prefix Index and ZoneMap Index .

User‑created secondary indexes: Inverted Index , BloomFilter Index , Ngram BloomFilter Index , and Bitmap Index .

All indexes are BE‑local; Doris does not have a global index, although partition and bucket keys act as coarse‑grained global identifiers.

Index storage format

Indexes are stored in the Segment file's Index Region. For example, the Short Key Index (prefix index) stores KeyBytes (index entries) and OffsetBytes (offsets) for every configurable number of rows (default 1024).

How queries hit indexes

Initially a row_bitmap marks rows to be read.

If a prefix index matches, the corresponding row range is intersected with row_bitmap.

Bitmap, BloomFilter, and ZoneMap indexes further intersect matching rows.

After all intersections, the engine reads the relevant data pages for each column.

Q7: How does Compaction work?

Compaction merges adjacent RowSets into a new RowSet, extending the version range ( Start, End) and reducing file count. Two types exist:

Base Compaction : Merges long‑term RowSets.

Cumulative Compaction : Merges recent incremental RowSets, separated by the cumulative_point marker.

Q8: Cross‑cluster data replication

Doris uses a Binlog mechanism. When enabled, FE and BE persist DDL/DML changes to Meta Binlog (ordered log IDs) and Data Binlog (rowset metadata and segment file links). Binlog replay enables data replay and recovery across clusters.

Q9: Write‑path consistency and quorum

Each Tablet has three replicas with no primary/secondary roles. Doris employs a quorum algorithm: a write is considered successful when a majority of replicas for every Tablet acknowledge the write, after which the transaction is marked COMMITTED and later PUBLISHED to become visible.

Q10: FE high‑availability

Metadata is stored using Paxos, BDBJE, and a combination of in‑memory, checkpoint, and journal mechanisms. Only the Leader FE accepts writes, logs them as key‑value pairs in BDBJE, and replicates the logs to Followers/Observers. Periodic checkpoints create new Image files, which Followers pull to stay synchronized. This three‑node setup (Leader, Follower, Observer) provides reliable, low‑latency metadata service.

For more detailed diagrams and code examples, refer to the original article.

big dataIndexingcompactionStorage EngineDatabase ArchitecturereplicationApache Doris
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.