Databases 14 min read

How Alibaba’s Tair Cache Engine Scaled to 500M QPS for Double 11

Alibaba’s Tair, a high‑performance distributed key/value cache, evolved through multiple versions to support massive traffic during Double 11, employing multi‑region deployment, hotspot hashing, memory merging, user‑space networking, and client optimizations that dramatically cut latency, improve scalability, and reduce operational costs.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s Tair Cache Engine Scaled to 500M QPS for Double 11

Tair Overview

Tair is Alibaba’s high‑performance, distributed, scalable, and highly reliable key/value storage system, widely used across e‑commerce, video, and many other Alibaba business units.

Development Timeline

2010.04 – Tair v1.0 launched in core Taobao systems.

2012.06 – v2.0 introduced LDB persistence.

2012.10 – RDB cache with Redis‑like interface.

2013.03 – Fastdump for bulk import.

2014.07 – v3.0 released with several‑fold performance boost.

2016.11 – Intelligent operations platform for Double 11.

2017.11 – Performance leap, hotspot hashing, resource scheduling for trillion‑scale traffic.

Key Features

High performance: supports up to 5 × 10⁸ QPS during Double 11 with sub‑millisecond latency.

High availability: automatic failover, rate limiting, multi‑zone and multi‑region redundancy.

Scalability: deployed across global data centers and all Alibaba BUs.

Broad business coverage: e‑commerce, Ant Financial, Cainiao, Amap, Alibaba Health, etc.

Typical Use Cases

MDB – cache to reduce backend DB pressure, temporary data storage.

LDB – general KV, transaction snapshots, high‑QPS counters.

RDB – complex data structures such as playlists and live rooms.

FastDump – rapid bulk import for low‑latency online reads.

Double 11 Challenges

Traffic growth outpaced transaction peaks, making low‑latency, cost‑effective scaling a critical challenge. Hotspot problems became severe, prompting the development of hotspot hashing and multi‑region, multi‑unit architectures.

Multi‑Region, Multi‑Unit Architecture

The system spans multiple regions, data centers, and units, separating traffic ingress, application, middleware, and data layers. Tair sits in the data layer alongside databases, providing synchronized data to keep business stateless.

Elastic Site‑Building

A dedicated operation platform (Taido) orchestrates tasks, validates connectivity, and ensures zero‑downtime deployment. Resource water‑level balancing across clusters is performed before each full‑chain stress test.

Data Synchronization

Multi‑unit deployments require fast data sync; during Double 11, per‑second sync reached ten‑million records, with mechanisms to resolve write conflicts across units.

Performance Optimizations

Server‑side improvements focus on lock reduction, lock‑free structures, and a user‑space network stack (DPDK + Alisocket). Client‑side upgrades replace Mina with Netty and adopt Kryo/Hessian serialization, boosting throughput.

Memory Data Structure

Tair allocates large memory blocks organized with slab allocators, hash maps, and memory pools, employing LRU chains for eviction. Fine‑grained locks, lock‑free structures, CPU‑local data, and RCU increase parallelism.

User‑Space Protocol Stack

DPDK + Alisocket moves packet processing to user space, outperforming kernel‑mode stacks and seastar by over 10%.

Memory Merging

Unused pages within partially filled slabs are merged, freeing significant memory and improving utilization in multi‑tenant environments.

Client Optimizations

Network framework switched to Netty with coroutine support, raising throughput by 40%; serialization switched to Kryo/Hessian, adding another 16% gain.

Hotspot Solutions

Hotspot hashing introduces hotzones on data nodes, using multi‑level LRU weighting and dynamic redistribution to spread load across the cluster, reducing per‑node water‑level from over 130% to safe levels during peak traffic.

Write Hotspot Handling

Hot write requests are merged by a dedicated thread and flushed periodically, dramatically lowering engine pressure.

Result

Through these combined techniques, Tair eliminated both read and write hotspots, sustained massive traffic, and achieved substantial cost reductions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabadistributed cacheScalable SystemsTairhotspot hashing
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.