MemTable Optimization and Single‑Replica Load in Apache Doris 2.0
The article explains how Apache Doris 2.0 improves data import performance by redesigning MemTable handling, introducing write‑path optimizations, parallel segment flushing, and a single‑replica load mode that reduces resource consumption and boosts throughput for both single‑ and multi‑concurrent workloads.
Apache Doris uses a storage engine similar to an LSM‑Tree where incoming rows are first written to a MemTable (implemented with a SkipList). When the MemTable is full it is flushed to disk in two stages: converting the row‑store layout to a column‑store layout with per‑column indexes, and then writing the column data into a Segment file.
During import, the BE (backend) module is split into upstream and downstream parts. The upstream BE parses raw data (Scan) and then packages it (Sink) for RPC delivery to the downstream BE. The downstream BE batches rows in MemTable, sorts and aggregates them, and finally flushes the MemTable to a Segment on disk.
Two main inefficiencies were identified:
The upstream‑downstream RPC follows a ping‑pong pattern, so a slow downstream MemTable processing step blocks the next request, reducing data transfer efficiency.
When loading data into a multi‑replica table, each replica repeats the MemTable processing, consuming extra CPU and memory and lowering overall throughput.
In Doris 2.0 the MemTable processing pipeline was optimized: the downstream BE no longer maintains key order in real time; ordering is deferred until just before the MemTable is flushed. A cache‑friendly column‑first sort using pdqsort replaces std::sort, improving sort performance and allowing timely RPC responses.
Parallel flushing was introduced. Previously, MemTable flush tasks were executed serially to preserve key order across Segments, which limited I/O and CPU utilization. Doris 2.0 assigns Segment IDs when submitting flush tasks, enabling concurrent flushes while preserving correct Segment ordering, and the Rowset builder was updated to handle non‑contiguous Segment IDs.
A new "single‑replica load" mode selects one replica as the primary for computation; only the primary performs sorting and compression, while other replicas simply pull the generated files. This reduces memory usage to one‑third and CPU consumption to roughly half compared with three‑replica loading, and significantly improves throughput under the same resource budget.
Performance tests (TPC‑H SF100 Lineitem) on a 3‑BE cluster (16 CPU / 64 GB each) show that for single‑concurrent Stream Load the overall import speed of Doris 2.0 is 2‑7× faster than version 1.2.6, and with single‑replica load the gain reaches 2‑8×. For multi‑concurrent INSERT INTO workloads, single‑replica load yields about a 50% speed increase.
To enable the new features, set the following configuration flags: enable_single_replica_load = true and optionally the session variable:
SET experimental_enable_single_replica_insert = true;These changes make Doris 2.0 more efficient for a wide range of import methods, including Stream Load, INSERT INTO, Broker Load, and S3 Load.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
