Evolution of Ximalaya KV Storage and XCache Architecture
Ximalaya’s KV storage progressed from a simple Redis master‑slave setup to client‑side sharding, then adopted Codis clustering for elastic scaling, integrated Pika’s disk‑based store with cold‑hot separation, introduced KV‑blob separation, fast‑slow command pools, second‑level expansion, ehash fields, large‑key circuit breaking, multi‑active data‑center replication, and now targets cloud‑native deployment, advanced features, and AI‑driven operations.
Ximalaya KV Storage Development History
KV storage is a core component of Ximalaya, handling billions of requests daily. This document introduces the evolution of Ximalaya's KV storage, covering its history, self‑developed + community cache system, and future plans.
Redis Master‑Slave Mode
Initially, Ximalaya used a simple Redis master‑slave architecture with VIP failover. While easy to deploy, it limited QPS (<10w) and data size (<10GB) due to single‑node constraints.
Sharding Mode
To overcome single‑node limits, a client‑side sharding scheme was introduced, hashing keys to different Redis nodes. This improved QPS but lacked elastic scaling and required client code changes when adding or removing nodes.
Cluster Mode Selection (2016)
Three popular solutions were evaluated:
Redis Cluster : official, decentralized, easy deployment, but tightly coupled and hard to upgrade.
Twemproxy : Twitter’s proxy, but not friendly for smooth scaling.
Codis Redis : open‑source proxy, compatible with Twemproxy, better performance, supports smooth scaling and visual management.
Codis Redis was chosen as the cluster solution.
Codis Redis Architecture
Codis uses a proxy layer that hashes keys (CRC32) to route commands to backend Redis shards. ZooKeeper provides service discovery; new proxies register automatically, and Jodis clients listen for changes. Codis also includes a web UI (Codis‑fe) and Sentinel management.
Elastic Scaling in Codis
Codis divides the keyspace into 1024 slots. When expanding from 2 to 4 shards, slots are migrated gradually (e.g., slots 256‑511 move from group‑1 to group‑3) and the proxy’s slot map is updated, achieving smooth scaling.
Limitations of Codis Redis
Issues include memory consumption (all data in RAM), high operational cost for many instances, long restart times, and costly master‑slave failover.
Codis Pika
Pika (360’s open‑source Redis‑compatible KV store) stores data on disk using RocksDB, breaking the memory limit of Redis. It supports most Redis commands but suffers from disk‑read latency spikes and high IO during compaction.
Cold‑Hot Data Separation
Because Pika stores everything on disk, a layer of Redis is added to cache hot data in memory while keeping cold data on disk, dramatically reducing response latency.
KV Separation Storage
RocksDB’s LSM‑tree causes write amplification. By separating large values into a dedicated blob file and keeping only keys and indexes in SST files, compaction writes are reduced, SSD wear is lowered, and overall storage size shrinks.
Fast‑Slow Command Separation
A dual thread‑pool model isolates fast commands (e.g., GET/SET) from slow commands (e.g., HGETALL). Each pool runs independently, preventing slow operations from blocking fast ones.
Second‑Level (秒级) Cluster Expansion
When expanding from 2 to 4 shards, slave instances are reassigned to new groups and the proxy’s slot routing is updated without moving any data, achieving near‑instant scaling.
EHash Data Type
To allow field‑level expiration in hash structures, an “ehash” type adds a timestamp to each field. Expired fields are removed during reads and the deletion is logged to the binlog for replication.
Large‑Key Detection and Circuit Breaking
The proxy monitors request sizes (e.g., string values > monitor_max_value_len, list length > monitor_max_batchsize, range > 1000, HGETALL) and classifies them as large‑key requests. Configurable circuit‑break strategies can blacklist offending keys, block specific commands, or reject a configurable percentage of requests.
Multi‑Active Data Center (同城多活)
To avoid single‑datacenter bottlenecks and improve fault tolerance, a dual‑read architecture is deployed across two data centers. In a failure, read traffic switches to the backup site while writes remain in the primary site. Future work aims for dual‑write active‑active mode.
Future Development Plans
Key directions for XCache include:
Feature enhancements such as Lua scripting, transactions, strong consistency, multi‑tenant support, bulk loading, multi‑get optimization, hot‑key detection, and static data analysis.
Cloud‑native transformation to automate deployment, scaling, and management of clusters.
Intelligent operations leveraging data‑driven SRE practices, machine learning, and AI (e.g., ChatGPT) to automate decision‑making and task orchestration.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.