Mastering Redis at Scale: Real‑World Use Cases, Performance Tweaks, and High‑Availability Strategies

This technical guide compiled by Tencent engineers explores common Redis data‑structure use cases, latency and memory considerations, compares distributed deployment options, and provides practical optimization, high‑availability, and troubleshooting techniques for large‑scale Redis (Codis) deployments.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering Redis at Scale: Real‑World Use Cases, Performance Tweaks, and High‑Availability Strategies

1. Common Redis Use Cases

Redis supports several core data structures, each suited to specific scenarios:

String : counters, user‑ID mapping, uniqueness checks, bitmap.

Hash : storing object attributes such as user profiles.

List : comment storage, message queues.

Set : eligibility checks (e.g., reward claims), data deduplication.

Sorted Set : leaderboards, delayed queues.

Other : distributed lock design (see two referenced articles on Redis Redlock reasoning).

2. Redis Selection Considerations

Latency consists of backend request overhead, network delay, and database lookup time. Reducing request count and database addressing dramatically lowers latency; a single‑threaded, in‑memory Redis can handle >100k ops/s, far surpassing disk‑based databases.

Memory consumption per simple SET operation (e.g., "hello" → "world") involves four structures:

dictEntry: 24 B allocated as 32 B.

redisObject: 16 B allocated as 16 B.

SDS key: 5 B + 9 B header = 14 B allocated as 16 B.

SDS value: 5 B + 9 B header = 14 B allocated as 16 B.

Total per entry ≈ 80 bytes.

3. Comparison of Three Redis Distributed Solutions

Codis, as an open‑source product, demonstrates low operational cost and smooth scaling.

4. Redis Distributed Architecture

Codis adopts a two‑layer design: proxy + storage . Compared with a CKV‑only design, the proxy layer simplifies architecture and allows scaling by adding proxies rather than expanding the data layer, though it adds deployment overhead when connection counts are low.

5. Redis Bottlenecks and Optimizations

5.1 HGETALL Performance

Running HGETALL for three months (90 days) at 30 ms per day totals 2700 ms, far slower than the expected nanosecond‑level read speed. The real bottleneck is I/O (network card and user‑kernel copy), not CPU.

5.2 Pipeline Optimization

Original approach performed 90 I/O operations. By using Redis pipeline, the workload was reduced to 6 I/O operations, saving ~1000 ms.

The Go client redisgo buffers commands in a bufio. When the buffer exceeds the default 4096 bytes, it flushes to the kernel. Each HGETALL command occupies 45 bytes, so 90 days × 45 B = 4050 B < 4096 B, allowing a single I/O.

5.3 Throughput vs QPS Trade‑off

Increasing pipeline command count improves throughput but may cause QPS to drop due to network saturation and Redis command‑queue buildup.

6. High Availability and Disaster Recovery

Reliability relies on Redis persistence (RDB snapshots and AOF logs) and remote hot‑backup via master‑slave replication. Periodic cold‑backup (48‑hour rolling local backup plus an external system) protects against accidental data loss.

Codis achieves HA through:

Proxy cluster failover via ZooKeeper or L5.

Redis cluster HA using Sentinel.

Sentinel monitors masters and slaves, promotes a replica when a master fails, and continues serving requests.

7. Split‑Brain Handling

Split‑brain occurs when network partitions cause multiple masters. To mitigate:

Deploy five Sentinel nodes; only when four agree a master is down is a failover triggered.

Agents on Redis servers detect loss of ZooKeeper connectivity and issue a downgrade command, sacrificing availability for consistency.

8. Practical Pitfalls and Fixes

8.1 Master‑Slave Switch Verification

After a switch, confirm the new configuration with:

grep "Generatedby CONFIG REWRITE" -C 10 {redis_conf_path}/*.conf

8.2 Data Migration

Before large migrations, back up data and shard metadata. Use slotsmgrt‑async‑status on the Codis server to monitor ongoing shard moves; a million‑key shard typically finishes in ~20 seconds.

8.3 Exception Handling After Crash

If Redis crashes and AOF is partially written, run: VIP_CodisAdmin/bin/redis‑check‑aof --fix appendonly.aof Then restart the instance.

8.4 Client Timeouts

Potential causes:

Network congestion – check with NOC tools.

Listen queue overflow – inspect net.core.somaxconn and adjust via sysctl -p.

Slow queries – use slowlog get and tune slowlog‑log‑slower‑than and slowlog‑max‑len.

8.5 Fork Overhead

Fork is required for RDB/AOF rewrites. Its duration scales with instance memory; keep each Redis instance under 10 GB. Prefer physical machines or virtualization that handles fork efficiently, and consider disabling transparent huge pages:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

8.6 Accidental FLUSHDB

If appendonly no, increase RDB trigger thresholds, back up the RDB file, and avoid manual BGREWRITEAOF. If appendonly yes, enlarge AOF rewrite parameters or kill the process, then remove the FLUSHDB command from the AOF backup before restoring.

8.7 Switching from RDB to AOF

Do not edit the config file directly. Instead, back up the RDB file, enable AOF via CONFIG SET appendonly yes, rewrite the config with CONFIG REWRITE, and trigger BGSAVE or BGREWRITEAOF to persist data.

---

Author: Tencent Technical Team (source: 技术领导力)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityredisCodis
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.