How a Redis Crash Slowed Our Shop to 20 s and the Multi‑Layer Caching Fix
A sudden Redis outage caused a shopping homepage to take 20 seconds to load, prompting a rapid analysis of data‑volume growth, a post‑mortem of the fallback logic, and the design of a combined local‑cache, MongoDB, and database strategy to restore fast, reliable service.
Background
A new recommendation feature was released on an e‑commerce homepage. After deployment the page response time increased to ~20 seconds, causing user complaints.
Root Cause
The feature stored recommendation data in Redis keyed by region and category. The data volume grew by several hundred times compared with the original design, exceeding the 1 GB memory allocated on the Alibaba Cloud Redis instance. When Redis ran out of memory it crashed, and the application fell back to querying the relational database directly, which dramatically slowed response time.
Immediate Mitigation
The quickest remedy was to enlarge the Redis instance from 1 GB to 4 GB, which restored normal latency. This fix does not address the underlying design problems.
Post‑mortem Findings
**Inefficient data schema** – many fields were cached that were either null or never used by the front‑end, inflating memory consumption.
**Fallback risk** – the code unconditionally queried the relational database when Redis was unavailable, which could overload the DB under higher traffic.
Design Alternatives Evaluated
Static page generation – rejected because traffic is still low and the required front‑end/back‑end changes are extensive.
Local in‑process cache – adds a fast layer but may exhaust the application server’s memory if all recommendation data is cached locally.
Replace Redis with MongoDB – MongoDB stores data on disk and uses Linux mmap to keep hot documents in memory, making it suitable for large document sets.
Combine local cache with MongoDB – keep hot recommendation data in a local cache refreshed every 5 minutes; fall back to MongoDB for the remaining data.
Layered Fallback Strategy
Use Apollo configuration to provide a default recommendation set when MongoDB is unavailable.
If Apollo defaults fail, query the relational database directly.
As a last resort, read from Redis (which only holds hot items).
Introduce a secondary local cache that stores the default data for 24 hours after the first DB fetch.
Final Architecture
The production solution adopts a composite caching layer:
Local in‑process cache – holds hot recommendation data, refreshed every 5 minutes.
MongoDB – persistent store for the full recommendation dataset, leveraging disk‑backed storage with memory‑mapped hot‑spot caching.
Default local cache – stores a static fallback dataset (e.g., Beijing Dongcheng district recommendations) for 24 hours.
Relational database – ultimate source of truth when all caches miss.
This layered design provides fast response times, mitigates single‑point failures, and balances memory usage across the stack.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
