When a 200GB Redis Crashes: Real‑World Lessons on Cache Design and Governance
A large‑scale system that stacked a relational database, MongoDB, and a 200GB Redis cache suffered cascading failures, prompting a thorough analysis of Redis misuse and a redesign that introduces monitoring, controlled client libraries, proxy layers, sharding, and robust high‑availability strategies.
Case Overview
A product line built a massive price‑storage system using a relational database, MongoDB for document storage, and a 200GB Redis cache to accelerate reads. The architecture formed a three‑tier stack: DB → MongoDB → Redis.
Redis handled the highest request volume, becoming a single point of failure. When the MongoDB cluster crashed due to a kernel bug, Redis continued serving traffic, but any Redis outage forced all traffic onto the underlying database, risking total collapse.
Problem Analysis
The core issue is over‑reliance on Redis . Its convenience led to indiscriminate use across diverse workloads, turning a cache into an implicit storage layer without proper governance. Specific symptoms included:
Redis blocked by heavy KEYS commands
Keepalived virtual‑IP failover failures
CPU saturation from Redis‑based calculations
Master‑slave sync failures triggering full sync and network saturation
Connection‑count explosions
These problems were amplified by lack of monitoring, missing persistence (AOF/RDB disabled), and an ad‑hoc logging service that stored millions of logs in Redis, eventually choking the cache with a 7 MB log entry.
Proposed Solutions
To regain control, the team outlined five key actions:
Deploy a comprehensive monitoring system that alerts before issues become critical.
Introduce a custom Redis client that records operation metrics (latency, key size, errors) and enforces usage policies.
Reclassify Redis from a storage role to a pure cache role, limiting its responsibilities.
Implement a bespoke persistence layer built on the Redis protocol for cases that truly need durability.
Design high‑availability patterns tailored to each usage scenario rather than a one‑size‑fits‑all approach.
Implementation Details
The redesign, codenamed “Phoenix,” includes:
Monitoring: End‑to‑end visibility from client request to data return.
Custom Client: Replaces existing .NET clients (BookSleeve, ServiceStack.Redis) with a unified library that logs every command and routes traffic via a configuration center.
Command Splitting: Safe commands are allowed directly; unsafe commands require approval through the config center.
Deployment Model: Switch from Keepalived to master‑slave plus Sentinel for failover.
Sharding: Hash‑based partitioning of large Redis instances into multiple shards, transparent to applications.
Redis Proxy: Provides a single entry point, supports multi‑instance deployment, and abstracts language‑specific client differences.
These changes aim to keep Redis as a controlled, observable cache while preventing uncontrolled data growth and ensuring that any failure can be isolated without cascading to the underlying database.
Challenges and Outlook
The migration must occur without disrupting business, especially the complex sharding process that risks data loss if mishandled. Nevertheless, the team believes the new architecture will deliver a stable, governable caching layer that supports future scaling needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
