How Grab Scaled Its Token Service: Migrating Redis to AWS Elasticache
This article details Grab's migration from a single‑node Redis to AWS Elasticache, outlining the evaluated solutions, the chosen architecture for horizontal read scalability, and a six‑step migration process that ensured safe, zero‑downtime transition.
Background
Grab is a ride‑hailing giant in Southeast Asia with 55 million app downloads and 1.2 million drivers.
The app authenticates requests with a token stored in Redis and persisted to MySQL. The original single‑node Redis could not keep up with rapid user growth.
Solution Options
Alternatives
(1) Multi‑node replication – improves fault tolerance but master‑failover causes write pauses.
(2) Build a Redis Cluster – solves availability but adds client‑side sharding complexity and requires custom migration logic.
Sharding relies on the client, increasing its complexity.
Adding new shards is cumbersome; you must design migration logic and move a set of users.
(3) Use AWS Elasticache – provides on‑demand replica nodes and handles data partitioning, but does not support adding new shards.
Choice
Grab needed horizontal scalability for read‑heavy traffic. AWS Elasticache best met this need: each shard can dynamically add replica nodes, supporting read scaling, though write scaling is limited by the fixed number of shards.
Read load dominates; write load is modest. After capacity planning, Grab selected three shards with two replicas each (nine nodes total).
Migration Process
After choosing AWS Redis Cluster, the migration was split into six steps to ensure safety.
Step 1
Move data from the old Redis nodes to the new cluster using scan, dump, and restore commands, minimizing impact on the legacy nodes.
Step 2
Applications start writing to the cluster while still writing to the old nodes asynchronously, allowing verification without affecting production.
Step 3
Switch to synchronous writes to the cluster, fully involving it in the business flow; any errors now affect real API calls, providing a live test.
Step 4
Read operations are performed against the cluster asynchronously and compared with the old nodes to validate data consistency.
Step 5
All reads are migrated to the cluster and the old Redis is retired from read traffic.
Step 6
Stop all writes to the old Redis, cut any remaining interaction, and complete the migration. Each step is configuration‑driven, allowing rapid rollback if needed.
Conclusion
The migration was straightforward, and Grab’s analytical approach and rigorous process are worth emulating.
This article is a translation of Grab’s engineering post.
http://engineering.grab.com/migrating-existing-datastoresSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
