Inside DeWu’s Self‑Built Redis: Architecture, Automation & High‑Availability
This article details DeWu's self‑built high‑performance distributed Redis cache system, covering its Proxy‑based architecture, core components like ConfigServer, Redis‑Proxy and Redis‑Server, the automated operations platform for deployment and scaling, as well as monitoring, alerting, stability measures and future roadmap.
Architecture Overview
The self‑built Redis system adopts a proxy architecture composed of three core components:
ConfigServer – a highly available service deployed across multiple availability zones using the Raft consensus protocol. It registers proxies, manages groups, adds/removes Redis‑Server instances, performs manual master‑slave switches, handles horizontal scaling and data migration, and conducts fault detection with automatic failover.
Redis‑Proxy – a stateless gateway that accepts client connections, parses the Redis RESP protocol, computes the target slot for each key ( slot = crc32(key) % 1024), and forwards commands to the appropriate Redis‑Server instance. It supports same‑city active‑active routing, asynchronous dual‑write for migration, and horizontal scaling.
Redis‑Server – based on open‑source Redis with extensions for slot synchronization, asynchronous migration, and async‑fork optimizations. Nodes are organized into independent groups (one master + N slaves) following a Share‑Nothing design, and the whole cluster is divided into 1024 slots.
All components are deployed on at least three nodes across different zones to guarantee high availability. Custom TCP protocols are used for inter‑ConfigServer communication to reduce failover latency.
ConfigServer Details
ConfigServer runs a Raft cluster (minimum three nodes) in separate availability zones. Its responsibilities include:
Registering and deregistering Redis‑Proxy instances.
Creating and deleting groups, adding or removing Redis‑Server nodes.
Detecting Redis‑Server failures by periodically sending PING and INFO commands to each node.
Marking nodes as subjectively down on timeout, propagating this state, and promoting to objectively down when a majority of ConfigServers agree.
Executing automatic failover: selecting the best slave (based on health, slave‑priority, replication offset, and runid), promoting it with SLAVEOF NO ONE, and reconfiguring remaining slaves.
The failover process typically completes within 12 seconds for a Redis‑Server master.
Redis‑Proxy Details
Redis‑Proxy is a lightweight, stateless service written in Go. It parses incoming RESP commands, calculates the slot using the CRC32 algorithm, and forwards the command to the Redis‑Server responsible for that slot. Performance optimizations include:
Reduction of temporary object allocations by ~20×, minimizing GC pressure.
Improved short‑connection handling, yielding ~10 % higher QPS.
Key features:
Same‑city active‑active – writes are directed to the nearest master, reads prefer the nearest healthy slave; if no local slave exists, reads fall back to the master or a remote slave.
Asynchronous dual‑write – during migration, the proxy writes to both the source cloud Redis and the target self‑built Redis without blocking the client. The dual‑write mode can be switched online (cloud‑first or self‑built‑first).
Redis‑Server Details
Redis‑Server extends the official Redis codebase with:
Slot synchronization and asynchronous migration capabilities.
Share‑Nothing group architecture: each group contains one master and multiple slaves that do not communicate with other groups.
Async‑fork support: fork‑related operations (AOF rewrite, RDB snapshot, full sync) complete in ~200 µs, independent of data size, reducing latency spikes (TP100 ≈ 1‑2 ms) and cutting fork time by 98 % compared to upstream Redis.
The server retains all native Redis data structures (String, Hash, List, Set, ZSet), AOF persistence, master‑slave replication, and Lua scripting.
Automation Operations Platform
The platform provides end‑to‑end management of Redis instances, consisting of:
Redis Management Console – visual UI for creating, expanding, and migrating instances.
Kv‑Admin – schedules deployment tasks, recommends machines, allocates ports, binds SLBs, lists instances, performs offline data analysis, and generates resource reports.
Kv‑Agent – runs on each ECS, executes deployment/start‑stop commands, and exports metrics for Prometheus.
APM / Prometheus – collects and visualizes metrics from all components.
Deployment rules ensure:
ConfigServer spans three zones.
Redis‑Server and Redis‑Proxy instances of the same cluster are not co‑located on a single ECS.
Memory usage per ECS does not exceed 90 % of its capacity.
Slot distribution is automatically rebalanced when groups are added or removed.
Scaling operations:
Vertical scaling – adjusts maxmemory on Redis‑Server nodes.
Horizontal scaling – adds new groups (master+slaves), rebalances slots, and triggers asynchronous data migration. Progress is visualized in the console.
Monitoring & Alerting
Metrics are collected for:
ECS : CPU, load, memory, network traffic, packet loss, disk I/O.
Proxy : QPS, latency percentiles (TP999/TP9999/TP100), active connections, CPU, memory, GC count, goroutine count.
Server : CPU, memory, network, connections, QPS, key count, hit rate, request latency.
Alerting covers resource saturation, node failures, master‑slave inconsistencies, and SLB traffic anomalies.
Stability Governance
To mitigate resource contention:
ECS instances are labeled and isolated per business tag.
Proxy CPU usage can be limited via the GOMAXPROCS setting.
Health checks run periodically, and a scoring system evaluates instances based on weighted metrics (CPU, memory, QPS, latency, etc.) to surface risky instances. Fault‑injection drills have demonstrated recovery times of ~12 s for Redis‑Server master failover and ~5 s for Proxy failures.
Future Work
Upgrade Redis‑Server to the latest community version (e.g., 7.0) to reduce custom maintenance.
Add hot‑key statistics and a local cache layer for frequently accessed keys.
Rewrite the high‑QPS Proxy in Rust to eliminate Go GC bottlenecks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
