How Alibaba’s Tair Cache Engine Scaled to 500M QPS for Double 11
Alibaba’s Tair, a high‑performance distributed key/value cache, evolved through multiple versions to support massive traffic during Double 11, employing multi‑region deployment, hotspot hashing, memory merging, user‑space networking, and client optimizations that dramatically cut latency, improve scalability, and reduce operational costs.
Tair Overview
Tair is Alibaba’s high‑performance, distributed, scalable, and highly reliable key/value storage system, widely used across e‑commerce, video, and many other Alibaba business units.
Development Timeline
2010.04 – Tair v1.0 launched in core Taobao systems.
2012.06 – v2.0 introduced LDB persistence.
2012.10 – RDB cache with Redis‑like interface.
2013.03 – Fastdump for bulk import.
2014.07 – v3.0 released with several‑fold performance boost.
2016.11 – Intelligent operations platform for Double 11.
2017.11 – Performance leap, hotspot hashing, resource scheduling for trillion‑scale traffic.
Key Features
High performance: supports up to 5 × 10⁸ QPS during Double 11 with sub‑millisecond latency.
High availability: automatic failover, rate limiting, multi‑zone and multi‑region redundancy.
Scalability: deployed across global data centers and all Alibaba BUs.
Broad business coverage: e‑commerce, Ant Financial, Cainiao, Amap, Alibaba Health, etc.
Typical Use Cases
MDB – cache to reduce backend DB pressure, temporary data storage.
LDB – general KV, transaction snapshots, high‑QPS counters.
RDB – complex data structures such as playlists and live rooms.
FastDump – rapid bulk import for low‑latency online reads.
Double 11 Challenges
Traffic growth outpaced transaction peaks, making low‑latency, cost‑effective scaling a critical challenge. Hotspot problems became severe, prompting the development of hotspot hashing and multi‑region, multi‑unit architectures.
Multi‑Region, Multi‑Unit Architecture
The system spans multiple regions, data centers, and units, separating traffic ingress, application, middleware, and data layers. Tair sits in the data layer alongside databases, providing synchronized data to keep business stateless.
Elastic Site‑Building
A dedicated operation platform (Taido) orchestrates tasks, validates connectivity, and ensures zero‑downtime deployment. Resource water‑level balancing across clusters is performed before each full‑chain stress test.
Data Synchronization
Multi‑unit deployments require fast data sync; during Double 11, per‑second sync reached ten‑million records, with mechanisms to resolve write conflicts across units.
Performance Optimizations
Server‑side improvements focus on lock reduction, lock‑free structures, and a user‑space network stack (DPDK + Alisocket). Client‑side upgrades replace Mina with Netty and adopt Kryo/Hessian serialization, boosting throughput.
Memory Data Structure
Tair allocates large memory blocks organized with slab allocators, hash maps, and memory pools, employing LRU chains for eviction. Fine‑grained locks, lock‑free structures, CPU‑local data, and RCU increase parallelism.
User‑Space Protocol Stack
DPDK + Alisocket moves packet processing to user space, outperforming kernel‑mode stacks and seastar by over 10%.
Memory Merging
Unused pages within partially filled slabs are merged, freeing significant memory and improving utilization in multi‑tenant environments.
Client Optimizations
Network framework switched to Netty with coroutine support, raising throughput by 40%; serialization switched to Kryo/Hessian, adding another 16% gain.
Hotspot Solutions
Hotspot hashing introduces hotzones on data nodes, using multi‑level LRU weighting and dynamic redistribution to spread load across the cluster, reducing per‑node water‑level from over 130% to safe levels during peak traffic.
Write Hotspot Handling
Hot write requests are merged by a dedicated thread and flushed periodically, dramatically lowering engine pressure.
Result
Through these combined techniques, Tair eliminated both read and write hotspots, sustained massive traffic, and achieved substantial cost reductions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
