Databases 21 min read

Lossless Scaling of Tencent Cloud Redis: Architecture, Challenges, and Solutions

During the COVID‑19 surge, Tencent Cloud Redis scaled losslessly by using a proprietary slot‑specific RDB sync method that avoids downtime, handles large keys, Lua scripts, and multi‑key commands, outperforming open‑source and DTS approaches while minimizing resource duplication and maintaining 24/7 service.

Tencent Cloud Developer

Mar 12, 2020

Lossless Scaling of Tencent Cloud Redis: Architecture, Challenges, and Solutions

During the COVID‑19 pandemic, the demand for online meetings and remote work surged dramatically. Tencent Meeting expanded its cloud capacity by adding one million cores in eight days, and the Redis cluster was scaled up by dozens of times within half an hour, with each expansion operation completing in less than 30 minutes. This article summarizes the speaker Wu Xufei’s presentation at the "YunJia Community Salon Online", detailing the practice and challenges of lossless scaling for Tencent Cloud Redis.

1. Challenges Caused by the Pandemic The rapid increase in remote work and online education led to a daily expansion of 15,000 hosts from Jan 29 to Feb 6. Services must run 24/7 without interruption; even a one‑minute outage would affect millions of users.

2. Open‑Source Redis Scaling Scheme The speaker describes the three dimensions of scaling: single‑node capacity (e.g., expanding a 4 GB shard to 8 GB), replica scaling (adding read replicas for read‑write separation), and shard (slot) scaling. The slot‑based scaling is limited by CPU capacity, as each additional shard essentially adds CPU resources.

The traditional open‑source approach involves:

Calculating slot memory usage and assigning target slots.

Setting the target slot to IMPORTING and the source slot to MIGRATING.

Transferring keys synchronously, which can cause blocking and latency.

A critical pitfall is the order of setting IMPORTING and MIGRATING. If the source slot is set to MIGRATING first, a request for a key that does not exist on the source may be redirected to the target before the target slot is ready, causing an infinite redirection loop.

3. Lossless Scaling Challenges

3.1 Large‑Key Problem Migrating large keys (hundreds of megabytes) can block the migration process for seconds to minutes, exceeding typical Redis timeout settings (100‑200 ms) and causing widespread request timeouts.

3.2 Lua Script Problem Lua scripts loaded via SCRIPT LOAD are not transferred during key migration, leading to EVALSHA failures on the target node.

3.3 Multi‑Key Commands / Slave Reads Commands like MGET that span keys located on different nodes can fail during migration. Additionally, read‑write separation via slaves can cause loading errors when the slave is still syncing large data sets.

4. Industry Alternatives (DTS) Data Transfer Service (DTS) creates a pseudo‑slave to perform full‑sync and incremental sync, translating RDB data into commands for the target. DTS solves the large‑key, Lua, and multi‑key issues and allows faster data transfer, but introduces complexity, temporary unavailability (30 s‑1 min), and double resource consumption during migration.

5. Tencent Cloud Redis Proprietary Scaling Solution The proprietary solution avoids third‑party components and minimizes resource duplication. Key steps include:

Preparation : Compute memory usage per slot and decide which slots to move.

Migration : The target node initiates a slot‑specific SYNC command; the source forks, generates an RDB containing only the selected slots, and streams it to the target.

Switch : After all slots are transferred and the offset is small, a failover is triggered. The target node becomes the primary, and the source node is removed.

This approach resolves the large‑key issue (by forking and transferring RDB), the Lua problem (Lua scripts are included in the RDB), and multi‑key command issues (no impact on client traffic). It also reduces HA impact, eliminates the need for client disconnection, and cuts resource costs because only the final nodes are created.

6. Q&A Highlights

• The number of slots remains 16,384, unchanged by cluster size. • Slot memory is calculated in real‑time on a slave to avoid over‑allocation. • In case of migration failures, the system falls back to extra resources without affecting customers. • Future scaling may adopt serverless, on‑demand models, but manual intervention will still be required for large‑scale workloads.

The presentation concludes with a brief lecturer bio and recruitment information for Tencent Cloud Redis engineers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance scalability cluster DTS Zero Downtime

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.