Automating Redis Resource Balancing to Cut DBA Effort
To handle growing memory pressure across thousands of Redis servers, the platform implements an automated, daily resource‑balancing scheduler that selects overloaded hosts, chooses optimal nodes based on instance count, tier, and placement rules, then safely migrates them through a multi‑step process with rigorous validation.
Why Resource Balancing Is Needed
The DeWu Redis management platform oversees hundreds of clusters, tens of thousands of Redis‑server nodes, and thousands of host machines. As business load grows, memory usage on individual hosts rises, threatening stability and limiting vertical scaling. To keep each host’s memory usage below a safe threshold while supporting rapid vertical expansion, the platform performs daily automated inspections and redistributes nodes from overloaded hosts to under‑utilized ones.
Why Automation Is Required
Initially, DBAs manually migrated nodes when memory usage exceeded limits, even after adding batch‑operation tools for adding replicas, switching masters, and removing nodes. Manual migration still demanded significant daily DBA effort and introduced stability risks such as missed, duplicated, or erroneous operations, especially as the number of hosts and nodes increased.
Node Selection Strategy
The platform follows a set of optimal‑selection rules to choose which nodes to migrate, aiming to minimize business impact and achieve even distribution across hosts:
Prioritize instances with the most nodes so that the same cluster’s nodes become more evenly spread.
Prefer non‑P0 instances to avoid high‑priority workloads.
Prefer replica (slave) nodes because moving them rarely affects business.
Prefer medium‑sized instances (1‑4 GB) ; if too small, more nodes are needed; if too large, data transfer takes longer and failure risk rises.
Avoid selecting nodes from the same cluster group simultaneously to reduce concentration.
The selection algorithm proceeds in multiple rounds, first targeting non‑P0 clusters with 1‑4 GB nodes, then 1‑5 GB nodes, and finally all clusters including P0 with 1‑4 GB nodes. If the required memory is still unmet, any remaining nodes are chosen in order. Throughout, the algorithm ensures that nodes from the same cluster group are not selected together.
Migration Process and Reliability Checks
The automated migration runs daily at 5 AM and consists of six tightly controlled steps, each validated before proceeding.
1. Add Replica Node
A new replica is provisioned on a host that satisfies placement rules (same availability zone, different host, same resource group, matching specs and version). The deployment also respects constraints such as not placing multiple Redis‑Server/Proxy nodes of the same cluster on the same host and limiting total host memory usage to 90%.
2. Verify Data Synchronization
After the replica is added, the system checks that the node is correctly allocated and that replication is healthy. It runs INFO REPLICATION on both master and replica, confirming state=online and non‑zero offsets. The check is performed twice, one minute apart, to ensure stability.
3. Perform Master‑Slave Switch
If the original node is a master, the new replica is promoted to master using SLAVEOF NO ONE, and the former master and other replicas are re‑configured to follow the new master.
4. Verify Master‑Slave Switch
The system re‑runs replication checks from the perspective of the new master and all replicas, ensuring the new master reports role=master with active replicas and that each replica shows master_link_status:up and valid offsets.
5. Delete Original Node
Once the switch is confirmed, the original node (now a replica or an unused master) is safely taken offline. Additional safeguards verify that no business traffic is directed to the node and that it is not a primary host for critical data.
6. Send Notification
If any step fails—e.g., due to insufficient host resources—the system sends a notification to the responsible DBA, allowing manual intervention.
Task Management and Monitoring
Generated migration tasks are scheduled for 5 AM execution but can be viewed, cancelled, rescheduled, or triggered immediately via the task dashboard. The dashboard displays pending tasks with node details, as well as completed historical tasks for audit purposes.
Conclusion
Through daily intelligent automated balancing, the platform keeps host memory usage at a high yet safe level, dynamically controlling usage below configured thresholds while preserving enough headroom for sudden spikes. The migration logic spreads cluster nodes across many hosts, reducing single‑point failure risk. The same automation can accelerate host decommissioning by pre‑migrating all nodes. Future enhancements include automated cluster deployment, vertical scaling, and automatic recovery from host failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
