Databases 12 min read

How We Cut etcd’s Storage Latency from Seconds to Milliseconds at 100 GB Scale

This article explains why Alibaba needed to boost etcd’s storage capacity, analyzes the bottleneck caused by BoltDB’s freelist spill operation, presents a hash‑based redesign that reduces allocation and release from O(n) to O(1), and shows benchmark results confirming dramatic latency improvements for large‑scale deployments.

Alibaba Cloud Native

May 14, 2019

How We Cut etcd’s Storage Latency from Seconds to Milliseconds at 100 GB Scale

Overview

etcd is an open‑source distributed key‑value store used by Kubernetes as its internal metadata ledger. Alibaba’s large‑scale clusters required storage capacities beyond the limits of the original etcd implementation, prompting a performance‑focused redesign of the underlying BoltDB storage engine.

Optimization Background

Initially Alibaba wrapped etcd with an etcd‑proxy that off‑loaded data to Tair (a Redis‑like store). This solved the capacity problem but introduced high latency and operational overhead, leading the team to investigate the root cause of the storage bottleneck.

etcd Internal Storage Working Principle

etcd stores data in two layers: an in‑memory B‑tree index and a persistent BoltDB disk layer. BoltDB is a lightweight (<3 KLOC) embedded key/value database that maps a single file into memory via mmap. Data is stored in fixed‑size pages (default 4 KB). When a key is deleted, the page is not returned to the OS; instead it is placed in a freelist for later reuse.

The freelist holds free page IDs in an array f.ids. Allocation scans this array linearly to find a contiguous block of pages, and after a block is found the algorithm copies array elements (via copy) to remove the allocated pages. Both the linear scan and the copy become extremely slow when the freelist contains many small fragments.

Problem Analysis

Stress tests showed that once etcd stored more than ~40 GB, a compaction triggered a dramatic increase in the spill operation latency—from ~1 ms to >8 s—causing read/write operations to become hundreds of times slower. Profiling identified the linear scan in arrayAllocate and the subsequent array copy as the primary performance culprits.

func (f *freelist) arrayAllocate(txid txid, n int) pgid {
    // linear scan over f.ids to find a contiguous block of size n
    // if found, remove it from f.ids and return the start pgid
    // otherwise return 0
}

Optimization Scheme

Inspired by memory allocators such as tcmalloc, the team replaced the naive freelist with a hash‑based structure:

freemaps : map[uint64]pidSet – key is the span size, value is a set of starting page IDs of that size.

forwardMap : map[pgid]uint64 – maps a start page ID to its span size.

backwardMap : map[pgid]uint64 – maps an end page ID to its span size.

This design enables O(1) allocation and O(1) release. The new mergeWithExistingSpan function merges a freed page with adjacent spans using the forward and backward maps, eliminating the costly O(n log n) merge step.

func (f *freelist) mergeWithExistingSpan(pid pgid) {
    prev := pid - 1
    next := pid + 1
    preSize, mergeWithPrev := f.backwardMap[prev]
    nextSize, mergeWithNext := f.forwardMap[next]
    newStart := pid
    newSize := uint64(1)
    if mergeWithPrev {
        start := prev + 1 - pgid(preSize)
        f.delSpan(start, preSize)
        newStart -= pgid(preSize)
        newSize += preSize
    }
    if mergeWithNext {
        f.delSpan(next, nextSize)
        newSize += nextSize
    }
    f.addSpan(newStart, newSize)
}

Actual Optimization Effect

Benchmarks compared three setups: the original etcd, the etcd‑proxy + Tair solution, and the new hash‑based freelist. In a simulated workload of 100 concurrent clients issuing 1 million random puts (≈20‑30 GB total), the new algorithm reduced client‑side latency from several seconds to sub‑millisecond levels. In larger‑scale tests (≈100 GB data), latency improvements were even more pronounced, effectively making a 100 GB store behave like a 2 GB store. All tests were run on a single etcd node to isolate storage effects; network latency was excluded.

Summary

The redesign transformed BoltDB’s freelist allocation from O(n) to O(1) and release from O(n log n) to O(1), eliminating the spill‑operation bottleneck that crippled etcd at massive data volumes. The change is fully backward‑compatible, requires no data migration, and has been merged upstream (see https://github.com/etcd-io/bbolt/pull/141), benefiting the broader etcd and BoltDB community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization hashmap etcd freelist

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.