Backend Development 13 min read

When a 200GB Redis Crashes: Real‑World Lessons on Cache Design and Governance

A large‑scale system that stacked a relational database, MongoDB, and a 200GB Redis cache suffered cascading failures, prompting a thorough analysis of Redis misuse and a redesign that introduces monitoring, controlled client libraries, proxy layers, sharding, and robust high‑availability strategies.

dbaplus Community

Feb 22, 2021

When a 200GB Redis Crashes: Real‑World Lessons on Cache Design and Governance

Case Overview

A product line built a massive price‑storage system using a relational database, MongoDB for document storage, and a 200GB Redis cache to accelerate reads. The architecture formed a three‑tier stack: DB → MongoDB → Redis.

Redis handled the highest request volume, becoming a single point of failure. When the MongoDB cluster crashed due to a kernel bug, Redis continued serving traffic, but any Redis outage forced all traffic onto the underlying database, risking total collapse.

Problem Analysis

The core issue is over‑reliance on Redis . Its convenience led to indiscriminate use across diverse workloads, turning a cache into an implicit storage layer without proper governance. Specific symptoms included:

Redis blocked by heavy KEYS commands

Keepalived virtual‑IP failover failures

CPU saturation from Redis‑based calculations

Master‑slave sync failures triggering full sync and network saturation

Connection‑count explosions

These problems were amplified by lack of monitoring, missing persistence (AOF/RDB disabled), and an ad‑hoc logging service that stored millions of logs in Redis, eventually choking the cache with a 7 MB log entry.

Proposed Solutions

To regain control, the team outlined five key actions:

Deploy a comprehensive monitoring system that alerts before issues become critical.

Introduce a custom Redis client that records operation metrics (latency, key size, errors) and enforces usage policies.

Reclassify Redis from a storage role to a pure cache role, limiting its responsibilities.

Implement a bespoke persistence layer built on the Redis protocol for cases that truly need durability.

Design high‑availability patterns tailored to each usage scenario rather than a one‑size‑fits‑all approach.

Implementation Details

The redesign, codenamed “Phoenix,” includes:

Monitoring: End‑to‑end visibility from client request to data return.

Custom Client: Replaces existing .NET clients (BookSleeve, ServiceStack.Redis) with a unified library that logs every command and routes traffic via a configuration center.

Command Splitting: Safe commands are allowed directly; unsafe commands require approval through the config center.

Deployment Model: Switch from Keepalived to master‑slave plus Sentinel for failover.

Sharding: Hash‑based partitioning of large Redis instances into multiple shards, transparent to applications.

Redis Proxy: Provides a single entry point, supports multi‑instance deployment, and abstracts language‑specific client differences.

These changes aim to keep Redis as a controlled, observable cache while preventing uncontrolled data growth and ensuring that any failure can be isolated without cascading to the underlying database.

Challenges and Outlook

The migration must occur without disrupting business, especially the complex sharding process that risks data loss if mishandled. Nevertheless, the team believes the new architecture will deliver a stable, governable caching layer that supports future scaling needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis caching

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.