Redis Cluster Slot Migration Optimization Based on Master-Slave Replication Mechanism
Vivo’s storage team redesigned Redis cluster slot migration by disguising the target as a slave and using master‑slave replication with progressive loading, which eliminates ask‑move round trips, reduces latency spikes, consolidates topology updates, and ensures stable access and high availability during horizontal scaling.
This article discusses the optimization of Redis cluster horizontal scaling operations by vivo's storage team. The native Redis cluster slot migration causes significant latency spikes during data migration, affecting business operations and even leading to cluster node failures.
Problem Analysis:
The native Redis migration operates at key granularity, involving operations such as serializing key-value pairs, sending data over network, waiting for acknowledgment, and cleaning up local copies. These synchronous blocking operations occupy the worker thread for extended periods, causing service latency spikes. Additionally, the ask-move mechanism doubles network round trips, and frequent topology changes during migration trigger concentrated topology refresh requests from clients, causing resource contention.
Solution:
Vivo developed a custom slot migration mechanism based on master-slave replication. The target node is伪装成 the source node's slave, using the master-slave protocol for data transfer. Key modifications include: adding slot information exchange between migrating nodes, restructuring RDB file structure by slot order, and recording slot data offset metadata at the RDB file end. The target node implements progressive loading - loading data incrementally as each network packet arrives rather than blocking until all data is received.
Optimization Results:
1) Minimal latency impact - progressive loading reduces large key migration impact; 2) Stable data access - no ask-move mechanism needed during migration, ensuring read-write consistency; 3) Fewer topology changes - supports migrating multiple slots simultaneously with single topology update; 4) High availability - migration state syncs to slave nodes, enabling automatic failover continuation.
Test results show that the custom migration maintains stable latency on both source and target nodes during data transfer, significantly outperforming native migration.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.