Databases 17 min read

How Alibaba’s Tair MDB Leverages NVM to Cut Cache Costs and Boost Performance

This article details Alibaba’s first production use of non‑volatile memory in the Tair MDB cache service, covering deployment, performance challenges such as write imbalance and lock overhead, the optimizations applied, and design guidelines for building NVM‑based caching systems.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s Tair MDB Leverages NVM to Cut Cache Costs and Boost Performance

Introduction

This article introduces the first production use of non‑volatile memory (NVM) in Alibaba Group’s environment, describing the online deployment, the problems encountered when using NVM, the optimization process, and finally summarizing design guidelines for building cache services based on NVM.

Background

Tair MDB is a widely used cache service within Alibaba’s ecosystem. It uses NVM as a supplement to DRAM, assisting DRAM as a backend storage medium. Since the 618 shopping festival, it has been rolled out in production, undergoing two full‑link stress tests and operating stably.

During NVM usage, Tair MDB faced issues such as write imbalance and lock overhead, which were mitigated through optimization, yielding significant improvements.

The engineering team distilled design principles for implementing cache services on NVM/PMEM, which can guide other products that wish to leverage NVM.

Production Environment

Effect

End‑to‑end read/write latency is comparable to DRAM under the same software version, and service behavior remains normal. The production pressure has not yet reached the node’s limit; later sections discuss issues encountered during stress testing and their solutions.

Cost

Because a single NVM DIMM offers much larger capacity than a DRAM DIMM, the same capacity costs less. Using NVM to supplement memory capacity can dramatically reduce cluster size, lowering overall cost by roughly 30%–50% when accounting for hardware, electricity, and rack expenses.

Principle

Usage Method

Tair MDB mounts NVM devices as block devices using a DAX‑mounted pmem‑aware file system. Allocation is performed by creating and opening a file on the file‑system path and using posix_fallocate to reserve space.

Memory Allocator

NVM’s non‑volatile nature allows Tair MDB to treat it as a volatile device, eliminating the need for explicit cache line flushes or crash‑recovery handling.

When using DRAM, allocators such as tcmalloc or jemalloc are available. For NVM, the space is exposed as a file, so a suitable allocator is required. The open‑source libmemkind library provides a malloc‑like API for persistent memory.

Tair MDB does not use libmemkind. Instead, it employs a slab‑based memory layout, allocating a large block at startup and managing metadata, data pages, and other structures within that block.

Memory Layout

The slab mechanism divides the pre‑allocated memory into several parts:

Cache Meta – stores metadata such as maximum shard count and slab manager indexes.

Slab Manager – manages fixed‑size slabs.

Hashmap – a global hash table using linear chaining for collisions.

Page Pool – memory pool split into 1 MiB pages; slab managers request pages from this pool.

At startup, Tair MDB initializes all available memory, eliminating the need for dynamic OS allocations during operation. NVM is mmap‑ed to obtain a virtual address space, allowing the internal memory manager to use it transparently without further malloc/free calls.

Stress Test

When using NVM as a DRAM supplement, Tair MDB encountered several issues during stress testing, which are detailed below.

Problem

Testing with 100‑byte entries revealed that read QPS/latency matches DRAM, but write TPS is about one‑third of DRAM.

Analysis

Performance profiling showed that most write overhead stems from lock contention around page writes, likely due to NVM’s higher write latency compared to DRAM.

PCM monitoring revealed severe write imbalance on one DIMM, with its write bandwidth roughly twice that of the others.

Placement Strategy

The system uses a 2‑2‑1 placement: four NVM DIMMs per socket, each on a different channel, with interleaving at 4 KiB granularity.

Root Cause of Imbalance

Memory interleaving caused writes to concentrate on a single region, which maps to the overloaded DIMM, leading to the observed imbalance.

Optimization

To eliminate the hotspot, the team first identified the hot region using Intel Pin to instrument memory writes. The hotspot was located in page metadata.

Solutions considered included padding, distributing hot data across DIMMs, and moving the hotspot structures back to DRAM. The final choice was to relocate the slab manager and page info to DRAM.

After this change, write TPS increased from 85 w to 140 w and write latency dropped from 40 µs to 12 µs.

Lock overhead remained high at 140 w TPS. Profiling showed pthread_spin_lock consuming significant time due to page item initialization writes.

By moving item initialization out of the critical section and further reducing NVM writes inside locks, lock overhead returned to normal, TPS rose to 170 w and write latency fell to 9 µs.

Design Guidelines

Hardware Characteristics

Higher density and lower cost than DRAM.

Higher latency and lower bandwidth than DRAM.

Read/write imbalance; write latency higher than read.

Wear‑out concerns with frequent writes to the same location.

Guideline A: Avoid Write Hotspots

Separate metadata from data and place metadata in DRAM.

Implement copy‑on‑write at the upper layer to reduce in‑place updates.

Continuously detect hot writes, migrate them to DRAM, and perform write merging.

Guideline B: Reduce Critical‑Section Access

Because NVM write latency amplifies the impact of locks, designs should minimize lock‑protected NVM operations, favoring lock‑free or RCU‑based approaches.

Guideline C: Use an Appropriate Allocator

Support fragmentation mitigation; avoid in‑place updates and prefer fixed‑size allocations.

Provide thread‑local quotas to reduce global contention.

Be capacity‑aware to allow dynamic scaling of managed space.

Future Work

The team plans to further exploit NVM’s non‑volatile nature, extracting more value from the hardware for cache services and other upper‑layer applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsCacheMemory ManagementNVM
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.