Backend Development 13 min read

RaftKeeper v2.1.0: 118% Faster Mixed Workloads and Snapshot Optimizations

The article details how RaftKeeper v2.1.0, a high‑performance distributed consensus service compatible with Zookeeper, achieves up to 118% throughput gains in mixed read/write scenarios and significant latency reductions through engineering optimizations such as parallel response serialization, list‑request redesign, system‑call pruning, thread‑pool adjustments, and asynchronous snapshot handling.

JD Cloud Developers

Jul 15, 2024

RaftKeeper v2.1.0: 118% Faster Mixed Workloads and Snapshot Optimizations

RaftKeeper is a high‑performance distributed consensus service fully compatible with Zookeeper, widely deployed in ClickHouse and other big‑data components like HBase to overcome Zookeeper's performance bottlenecks.

Version v2.1.0, released after v2.0.0, introduces several new features—including asynchronous snapshot creation—and delivers notable performance gains: write‑request throughput improves by 11% and mixed read/write workloads see a 118% increase.

1. Performance Optimization Effects

Benchmarks were run with the raftkeeper‑bench tool on a three‑node cluster (16 CPU cores, 32 GB RAM, 100 GB storage per node). The test compared RaftKeeper v2.1.0, RaftKeeper v2.0.4, and ZooKeeper 3.7.1 using default configurations.

Two test groups were defined:

Group 1 measured pure create operations (value size 100 bytes). RaftKeeper v2.1.0 outperformed v2.0.4 by 11% and ZooKeeper by 143%.

Group 2 used a mixed request ratio of create 1%, set 8%, get 45%, list 45%, delete 1% (list results contain 100 child nodes, each 50 bytes; other operations use 100‑byte values). RaftKeeper v2.1.0 achieved a 118% improvement over v2.0.4 and a 198% improvement over ZooKeeper.

RaftKeeper v2.1.0 also showed better average response time (avgRT) and TP99 metrics compared to v2.0.4.

2. Performance Optimizations

1. Parallel Response Serialization

Flame‑graph analysis of a large RaftKeeper cluster revealed that the ResponseThread spent a large portion of CPU time on response serialization. By moving serialization to IO threads and allowing concurrent execution, latency was reduced.

Additionally, the sdallocx_default function (jemalloc memory release) consumed many CPU cycles due to mutex‑protected queues. The fix releases the response memory before calling tryPop:

<span>/// responses_queue is a mutex‑protected sync queue; releasing response_for_session in tryPop adds lock time</span></code><code><span>responses_queue.tryPop(response_for_session, std::min(max_wait, static_cast<UInt64>(1000)))</span>

After moving the memory release earlier, throughput increased by 31% and average response time decreased by 32% at a concurrency level of 10.

2. List‑Request Optimization

Flame‑graphs showed that request‑processor threads spent almost all CPU time handling List requests, which allocate memory for each string in a std::vector<string>. Two bottlenecks were identified: string memory allocation and vector insertion.

The solution introduced a compact string storage layout using separate data and offset buffers (CompactStrings). After applying this design, CPU usage for List handling dropped from 5.46% to 3.37%, and benchmark TPS rose from 45.8 k/s to 61.9 k/s with lower TP99.

read requests 14826483, write requests 0, Read RPS: 458433, Read MiB/s: 2441.74, TP99 1.515 msec</code><code>read requests 14172371, write requests 0, Read RPS: 619388, Read MiB/s: 3156.67, TP99 0.381 msec

3. Reducing Unnecessary System Calls

Profiling with bpftrace revealed excessive getsockname and getsockopt calls originating from log‑trace statements. Removing these calls eliminated a significant amount of kernel‑user context‑switch overhead.

BPFTRACE_MAX_PROBES=1024 bpftrace -p 4179376 -e ' tracepoint:syscalls:sys_enter_* { @start[tid] = nsecs; } tracepoint:syscalls:sys_exit_* /@start[tid]/ { @time[probe] = sum(nsecs - @start[tid]); delete(@start[tid]); @cc[probe] = sum(1); } interval:s:10{ exit(); }'

4. Thread‑Pool Optimization

Flame‑graph analysis of a read‑write (4:6) benchmark showed that the request‑processor thread spent over 60% of its CPU time waiting on condition variables. By removing the thread‑pool for read requests and processing them in a single thread, TPS increased by 13%.

thread_size,tps,avgRT(microsecond),TP90(microsecond),TP99(microsecond),TP999(microsecond),failRate</code><code>200,84416,2407.0,3800.0,4500.0,8300.0,0.0</code><code>200,108950,1846.0,3100.0,4000.0,5600.0,0.0

3. Snapshot Optimizations

1. Asynchronous Snapshot

Creating a snapshot on the main thread blocks user requests, especially with large data volumes (e.g., 60 M entries took 180 s). The new async snapshot copies the DataTree on the main thread and serializes the copy in the background, reducing user‑visible blocking time from 180 s to 4.5 s, at the cost of ~50% additional memory.

Further vectorized copying of the DataTree using SSE instructions lowered copy time from 4.5 s to 3.5 s:

inline void memcopy(char * __restrict dst, const char * __restrict src, size_t n) { auto aligned_n = n / 16 * 16; auto left = n - aligned_n; while (aligned_n > 0) { _mm_storeu_si128(reinterpret_cast<__m128i *>(dst), _mm_loadu_si128(reinterpret_cast<const __m128i *>(src))); dst += 16; src += 16; aligned_n -= 16; __asm__ __volatile__("" : : : "memory"); } ::memcpy(dst, src, left); }

2. Snapshot Load Speed

Older versions loaded a 60 k‑entry snapshot in 180 s on NVMe storage. By parallelizing the second (single‑threaded) step of building parent‑child relationships across DataTree buckets, load time dropped to 99 s, and subsequent lock, format, and copy optimizations reduced it further to 22 s.

4. Production Impact

In a ClickHouse cluster with heavy ZooKeeper traffic (≈170 k QPS, mostly List requests), upgrading from ZooKeeper to RaftKeeper v2.0.4 reduced performance, but RaftKeeper v2.1.0 delivered substantial gains, confirming the effectiveness of the optimizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance benchmark Snapshot RaftKeeper

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.