How Alibaba’s Tair MDB Leverages NVM to Cut Cache Costs and Boost Performance
This article details Alibaba’s first production use of non‑volatile memory in the Tair MDB cache service, covering deployment, performance challenges such as write imbalance and lock overhead, the optimizations applied, and design guidelines for building NVM‑based caching systems.
Introduction
This article introduces the first production use of non‑volatile memory (NVM) in Alibaba Group’s environment, describing the online deployment, the problems encountered when using NVM, the optimization process, and finally summarizing design guidelines for building cache services based on NVM.
Background
Tair MDB is a widely used cache service within Alibaba’s ecosystem. It uses NVM as a supplement to DRAM, assisting DRAM as a backend storage medium. Since the 618 shopping festival, it has been rolled out in production, undergoing two full‑link stress tests and operating stably.
During NVM usage, Tair MDB faced issues such as write imbalance and lock overhead, which were mitigated through optimization, yielding significant improvements.
The engineering team distilled design principles for implementing cache services on NVM/PMEM, which can guide other products that wish to leverage NVM.
Production Environment
Effect
End‑to‑end read/write latency is comparable to DRAM under the same software version, and service behavior remains normal. The production pressure has not yet reached the node’s limit; later sections discuss issues encountered during stress testing and their solutions.
Cost
Because a single NVM DIMM offers much larger capacity than a DRAM DIMM, the same capacity costs less. Using NVM to supplement memory capacity can dramatically reduce cluster size, lowering overall cost by roughly 30%–50% when accounting for hardware, electricity, and rack expenses.
Principle
Usage Method
Tair MDB mounts NVM devices as block devices using a DAX‑mounted pmem‑aware file system. Allocation is performed by creating and opening a file on the file‑system path and using posix_fallocate to reserve space.
Memory Allocator
NVM’s non‑volatile nature allows Tair MDB to treat it as a volatile device, eliminating the need for explicit cache line flushes or crash‑recovery handling.
When using DRAM, allocators such as tcmalloc or jemalloc are available. For NVM, the space is exposed as a file, so a suitable allocator is required. The open‑source libmemkind library provides a malloc‑like API for persistent memory.
Tair MDB does not use libmemkind. Instead, it employs a slab‑based memory layout, allocating a large block at startup and managing metadata, data pages, and other structures within that block.
Memory Layout
The slab mechanism divides the pre‑allocated memory into several parts:
Cache Meta – stores metadata such as maximum shard count and slab manager indexes.
Slab Manager – manages fixed‑size slabs.
Hashmap – a global hash table using linear chaining for collisions.
Page Pool – memory pool split into 1 MiB pages; slab managers request pages from this pool.
At startup, Tair MDB initializes all available memory, eliminating the need for dynamic OS allocations during operation. NVM is mmap‑ed to obtain a virtual address space, allowing the internal memory manager to use it transparently without further malloc/free calls.
Stress Test
When using NVM as a DRAM supplement, Tair MDB encountered several issues during stress testing, which are detailed below.
Problem
Testing with 100‑byte entries revealed that read QPS/latency matches DRAM, but write TPS is about one‑third of DRAM.
Analysis
Performance profiling showed that most write overhead stems from lock contention around page writes, likely due to NVM’s higher write latency compared to DRAM.
PCM monitoring revealed severe write imbalance on one DIMM, with its write bandwidth roughly twice that of the others.
Placement Strategy
The system uses a 2‑2‑1 placement: four NVM DIMMs per socket, each on a different channel, with interleaving at 4 KiB granularity.
Root Cause of Imbalance
Memory interleaving caused writes to concentrate on a single region, which maps to the overloaded DIMM, leading to the observed imbalance.
Optimization
To eliminate the hotspot, the team first identified the hot region using Intel Pin to instrument memory writes. The hotspot was located in page metadata.
Solutions considered included padding, distributing hot data across DIMMs, and moving the hotspot structures back to DRAM. The final choice was to relocate the slab manager and page info to DRAM.
After this change, write TPS increased from 85 w to 140 w and write latency dropped from 40 µs to 12 µs.
Lock overhead remained high at 140 w TPS. Profiling showed pthread_spin_lock consuming significant time due to page item initialization writes.
By moving item initialization out of the critical section and further reducing NVM writes inside locks, lock overhead returned to normal, TPS rose to 170 w and write latency fell to 9 µs.
Design Guidelines
Hardware Characteristics
Higher density and lower cost than DRAM.
Higher latency and lower bandwidth than DRAM.
Read/write imbalance; write latency higher than read.
Wear‑out concerns with frequent writes to the same location.
Guideline A: Avoid Write Hotspots
Separate metadata from data and place metadata in DRAM.
Implement copy‑on‑write at the upper layer to reduce in‑place updates.
Continuously detect hot writes, migrate them to DRAM, and perform write merging.
Guideline B: Reduce Critical‑Section Access
Because NVM write latency amplifies the impact of locks, designs should minimize lock‑protected NVM operations, favoring lock‑free or RCU‑based approaches.
Guideline C: Use an Appropriate Allocator
Support fragmentation mitigation; avoid in‑place updates and prefer fixed‑size allocations.
Provide thread‑local quotas to reduce global contention.
Be capacity‑aware to allow dynamic scaling of managed space.
Future Work
The team plans to further exploit NVM’s non‑volatile nature, extracting more value from the hardware for cache services and other upper‑layer applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
