Databases 17 min read

How Our Self‑Built Redis Evolved: Architecture, SDK, and Performance Gains

This article details the three‑year evolution of a self‑built Redis service, covering its massive scale, architectural redesign, migration from LB to a custom DRedis SDK, same‑city active‑active near‑read support, Redis‑server version upgrades, instance specifications, proxy rate‑limiting, and extensive automation that together boost performance while cutting costs.

DeWu Technology
DeWu Technology
DeWu Technology
How Our Self‑Built Redis Evolved: Architecture, SDK, and Performance Gains

System Architecture

The self‑built Redis service consists of three core components:

Redis‑server : stores data, supports master‑slave replication across multiple availability zones, and provides high availability and performance.

Redis‑proxy : a gateway that presents the cluster as a single endpoint. It implements zone‑aware near‑read, key‑level and command‑level rate limiting, and command blacklists.

ConfigServer : coordinates high‑availability metadata such as proxy topology.

Clients can access the cluster via domain+LB, service, or the recommended direct SDK connection.

Access Method Evolution

Initially traffic entered through a load balancer (LB) before reaching the proxy. The LB imposed a 5 Gbps ceiling, was vulnerable to network attacks, and caused TCP‑level errors under heavy load.

To eliminate these bottlenecks a custom DRedis SDK was built. The SDK registers with a service registry, fetches proxy topology from ConfigServer, and selects a proxy using weighted round‑robin (preferring the same zone). It extends the RESP protocol with custom commands that enable optional near‑read on a per‑key or per‑request basis.

DRedis SDK Architecture
DRedis SDK Architecture

Supported languages:

Java – built on Redisson (future Jedis variant)

Golang – built on go‑Redis v9

C++ – based on brpc (upcoming)

Near‑Read

Near‑read can be enabled globally or per request. In Java it is activated with the @NearRead annotation; in Golang the SDK provides ~80 “ xxxNearby ” commands. This allows fine‑grained routing of reads to proxies in the same availability zone, reducing latency and cost.

Same‑City Active‑Active (Center‑Write, Near‑Read)

The SDK automatically prefers proxies in the same zone. If only one proxy exists, it falls back to cross‑zone proxies. Service‑based access can also enable near‑read, but it applies globally and cannot be toggled per request.

Redis‑Server Versions and Capabilities

Both Redis 4.0 (legacy) and Redis 6.2 (default for new clusters) are supported. Key capabilities:

Multithreaded I/O : Redis 6.2 natively supports I/O threading; the feature was back‑ported to 4.0. Tests show significant read/write throughput gains.

Real‑time hot‑key statistics : Hot‑key metrics are exposed in the management console for rapid troubleshooting.

Asynchronous horizontal scaling : Slot migration runs concurrently, reducing a typical 4‑hour migration of billions of keys to ~10 minutes (≈20× faster) and cutting RT impact by >90%.

Horizontal Scaling Performance
Horizontal Scaling Performance

Instance Architecture and Specifications

Clusters run in a transparent proxy‑masked mode, but a single‑node master‑slave mode is also available for lightweight workloads.

Replica configurations:

1‑master‑1‑slave (default)

1‑master‑2‑slaves

1‑master‑3‑slaves

Read‑write separation can be enabled to route reads to slaves, improving read throughput.

Replica Configurations
Replica Configurations

Proxy Rate Limiting

The proxy enforces:

Key‑level QPS thresholds

Command‑level QPS thresholds

Command blacklists to disable risky operations (e.g., large‑key scans) per cluster

Automation Operations

The platform provides end‑to‑end automation for the full Redis cluster lifecycle:

Provisioning, scaling, and decommissioning via work‑order approval.

Intelligent resource‑pool balancing based on memory and CPU utilization, with scheduled maintenance migrations.

Automatic vertical scaling when memory usage exceeds 80%.

Automated fault detection and recovery, including node restart on machine failure.

Automation Overview
Automation Overview

Scale Status (as of writing)

The service manages over 1,000 clusters, 160 TB of memory, >100 k data nodes, and thousands of machines. Tens of clusters exceed 1 TB per node, and the largest cluster handles close to ten million QPS.

Key Takeaways

The DRedis SDK removes LB bottlenecks, provides zone‑aware near‑read, and supports per‑request control via annotations or specialized commands.

Both Redis 4.0 and 6.2 offer multithreaded I/O, hot‑key monitoring, and fast asynchronous scaling.

Flexible deployment options (cluster mode, single‑node mode) and multiple replica specifications enable HA and read‑write separation.

Proxy rate‑limiting and command blacklisting improve stability under burst traffic.

Comprehensive automation reduces operational overhead and enables automatic vertical scaling.

PerformanceSDKArchitectureCacheAutomationdatabaseRedis
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.