How DeWu Cut Redis RT by 90% with a Full‑Scale Self‑Built Redesign
The article details DeWu's three‑year evolution of its self‑built Redis platform—covering architecture, access method changes, version upgrades, proxy rate limiting, and automated operations—that together reduced request latency by over 90% while supporting more than 1,000 clusters, 160 TB of memory and near‑10‑million QPS.
For over three years DeWu has continuously iterated its self‑built Redis system, focusing on architecture, performance, cost reduction, and automated operations. The platform now manages over 1,000 clusters, 160 TB of memory, more than 100 k data nodes, and peaks at nearly 10 million QPS.
The core architecture consists of three components: Redis‑server for data storage (supporting master‑slave and multi‑AZ deployment), Redis‑proxy which abstracts the cluster and provides same‑zone read preference and rate‑limiting features, and ConfigServer for high‑availability management.
Access methods evolved from domain + LB to a self‑developed DRedis SDK that connects directly to the proxy. The original LB approach suffered from a 5 Gb traffic limit, flow‑skew, susceptibility to network attacks, and TCP‑level errors. The SDK eliminates these bottlenecks, defaults to same‑zone proxy selection, and supports near‑read. It is available for Java (Redisson‑based, with a future Jedis version), Golang (go‑Redis v9), and soon C++ (brpc).
The DRedis SDK brings noticeable stability and performance gains; a community application upgrade showed a dramatic RT drop, illustrated below.
Adoption is widespread: more than 300 business domains have migrated, with over 300 Java and Golang instances now using the SDK.
For same‑city active‑active (dual‑city) near‑read, the design writes centrally but reads locally. The proxy automatically prefers same‑zone servers; before the SDK, LB could not route requests this way, requiring a service‑based approach that introduced operational complexity, higher cost, increased RT, load imbalance, and limited customization. The SDK offers two mechanisms to enable near‑read: (1) key‑exact or prefix matching, and (2) declarative annotations such as @NearRead in Java or ~80 specialized nearby‑read commands in Golang.
Redis‑server versions progressed from 4.0 to 6.2, with 6.2 set as the default for new clusters. Both versions support multithreaded I/O, real‑time hot‑key statistics, and asynchronous slot migration. Multithreading (IO thread pool) noticeably improves read/write throughput. Hot‑key stats are displayed on the management console for quick diagnosis.
Asynchronous horizontal scaling reduces migration time for billions of keys from an average of four hours to ten minutes—a 20× speedup—while cutting RT impact by more than 90%.
Instance architecture defaults to a clustered setup hidden behind the proxy, making the cluster appear as a single Redis instance to applications. For low‑volume scenarios, a single‑master‑slave mode is also available. Replication specifications include one‑master‑one‑slave (default), one‑master‑two‑slaves, and one‑master‑three‑slaves, each with defined placement across availability zones.
The proxy provides fine‑grained rate limiting: per‑key QPS thresholds, per‑command QPS thresholds, and a command blacklist that disables specific commands on selected clusters.
Automation is a cornerstone of the platform. The system offers a comprehensive lifecycle management suite: resource‑pool balancing based on memory and CPU usage, automatic vertical scaling when memory usage exceeds 80%, automated deployment and decommissioning via work orders, and alarm handling that auto‑restarts failed nodes. Over 80 % of operational scenarios—instance creation, password/permission requests, key deletion, scaling, and cluster shutdown—are fully automated.
In summary, DeWu's self‑built Redis platform now supports the latest Redis 6.2 features, multithreaded I/O, hot‑key monitoring, and asynchronous horizontal scaling, while offering both clustered and single‑master‑slave deployment options, flexible replication specs, proxy‑level rate limiting, and a powerful automated operations platform that together deliver over 90 % latency reduction and significant cost savings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
