Operations 13 min read

Redis Cluster Migration Lessons: Real‑World Failures and Practical Solutions

This article recounts a series of July Redis incidents—including network‑card saturation, connection‑limit exhaustion, suspected split‑brain, Bgsave‑induced OOM, and master‑restart data loss—detailing the migration to Redis Cluster with a Smart Proxy, the challenges faced, and actionable remediation strategies.

Efficient Ops
Efficient Ops
Efficient Ops
Redis Cluster Migration Lessons: Real‑World Failures and Practical Solutions

Redis Cluster Migration Path

Our Redis deployment was centralized, with N machines dedicated to each product line and a traditional Twemproxy layer. Initially we operated a single default cluster that all PHP services accessed; as traffic grew we began splitting functionality across clusters.

In May I joined the company and drove the migration to Redis Cluster to replace Twemproxy, defining the following plan: Redis Cluster => Smart Proxy => PHP The cluster mode enables automatic scaling and allows machines to be treated as a resource pool.

We placed a Cluster‑aware Smart Proxy in front of PHP to hide Redis Cluster complexity. Because the company maintained custom Redis and Twemproxy versions, a real‑time synchronization tool was required for a seamless migration.

We leveraged @goroutine Redis‑Port (thanks to Codis author Liu Qi) which acts as a versatile Redis Swiss‑army knife, supporting:

Real‑time sync of two clusters

Cross‑datacenter synchronization

Selective key sync

Key deletion

Memory usage statistics

Migration steps:

Redis Master → Redis‑Port → Smart Proxy → Redis Cluster (Redis‑Port reads from the old master and writes to the new cluster)

Update PHP configuration, publish via GitLab, and switch to the new cluster settings

Decommission the old Twemproxy cluster

This approach allowed migration without stopping the business.

Note:

The custom Smart Proxy is essential to shield applications from Redis Cluster intricacies.

Although the plan appears simple, Bgsave can cause latency spikes; it must run during low‑traffic periods and sync should be staggered across masters, never using Redis‑Port for simultaneous sync.

Issue 1: Network‑Card Saturation

During a Friday night peak (23:00), a 1 Gb NIC on four machines (20 instances each) was saturated, causing request failures. Unable to scale instantly, the RD team downgraded services and dropped 30 % of traffic, which only recovered after the peak passed.

Subsequent market pushes recreated a small peak, forcing another 30 % traffic drop. We then built two new Redis Cluster instances using a cold‑start approach, gradually shifting cache traffic while MySQL replicas briefly spiked to 8 K QPS.

Solutions:

Bond all Redis NICs for 2 Gb throughput.

Disperse product‑line deployments to avoid resource contention.

Introduce NIC‑traffic monitoring with alerts at 60 % utilization.

Reflection

Insufficient alerting and delayed response highlighted the need for better monitoring and proactive capacity planning.

Issue 2: Connection‑Limit Exhaustion

At around 08:40, Redis reported connection errors as several instances hit their max‑client limits. The PHP Redis extension had a bug that prevented connection release, and the systemd service lacked a proper LimitNOFILE setting, capping maxclients at roughly 4 000.

Due to the severity, leadership rolled back to the legacy Twemproxy version for the two most critical product lines.

Reflection

Architecture changes were not thoroughly tested; bugs reproduced offline were deployed without sufficient validation.

Operations awareness was lacking; systemd limits were not audited.

Even “the best language” can cause issues—introducing a proxy layer between PHP and Redis can protect the backend.

Issue 3: Suspected Cluster Split‑Brain

Early morning alarms indicated data inconsistency—some reads returned no data, others returned stale data. Investigation revealed that different Redis instances had divergent cluster node configurations because a recent migration had not propagated the new config to all PHP services.

After pushing the updated configuration, the issue resolved.

Reflection

Configuration changes must consider the impact on all environments; lack of automatic discovery is a drawback.

Shielding details behind a Smart Proxy proved vital.

Human error remains a major source of incidents.

Issue 4: Bgsave‑Induced OOM

A classic failure caused Redis to run out of memory during Bgsave.

Solution

Rotate Bgsave across different ports on a single machine; if memory is insufficient, first evict cache entries, abort the Bgsave, and raise an alert.

Issue 5: Master Restart Flushes Slave

Backup was performed only on the slave. When the master restarted, systemd relaunched it faster than the cluster election, causing it to load an empty dump.rdb. The master then replicated this empty state to the slave, wiping its data.

Solution

During backup, rsync dump.rdb to the master’s data directory as well.

For Redis instances used as storage, disable the auto‑restart option in systemd.

Other Typical Problems

Application design issues: oversized HSET entries (>480 k records) causing frequent pauses.

Using Redis as a counter consumes excessive memory; hash/list linear storage can mitigate this, though MGET limitations prevented adoption.

Mixed deployment leads to resource contention across product lines.

IDC rack failure isolates an entire cabinet, breaking all mixed‑deployment services.

Confusion between cache and storage keys, with missing TTLs causing memory bloat.

Final Thoughts

The company heavily relies on Redis for all non‑image data and is actively migrating to Redis Cluster while still operating a Twemproxy cache layer. Outstanding challenges include:

Instance‑level and rack‑level high availability.

Resource isolation (e.g., Docker) for mixed deployments.

Decoupling application languages from Redis via a stable Smart Proxy interface.

Centralized configuration management for cluster build and delivery.

Addressing memory waste in low‑QPS clusters, exploring persistent Redis‑protocol stores such as ARDB or LedisDB, and gaining team trust for custom development.

Currently, production runs two versions: Twemproxy with auto_reject_host for cache clusters, and Redis Cluster + Smart Proxy for storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

migrationOperationsredisClustertroubleshootingSmart Proxy
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.