Cache Governance and High‑Availability Practices for Redis in a Hotel Quoting System
The article describes how a hotel quoting team identified recurring Redis‑related outages, designed a comprehensive cache governance plan—including fast recovery, multi‑replica, manual downgrade, and parameter tuning—and executed a structured process that improved system reliability and operational readiness.
Author Introduction Zheng Jiming joined the domestic hotel quotation center team in August 2019, focusing on quotation system development and architecture optimization. He has strong interests in high‑concurrency, high‑availability systems, experience building distributed systems handling tens of millions of daily orders, and has participated twice in the ACM‑ICPC regional preliminaries.
Background
In September 2019 the team experienced several cache‑related incidents:
DBA operational error cleared core data stored in Redis, causing a sudden drop in order volume (ATP). The issue was recovered by a scheduled task that rewrote the data back to Redis within half an hour.
Heavy PC‑side crawler traffic saturated the Redis connection pool, causing requests to wait up to 500 ms for a connection, which in turn filled the Tomcat thread pool and stalled the service, even though the Redis server itself was not under pressure.
Many other cache‑related failures existed, and the review revealed that critical scenarios relied heavily on Redis without proper failure‑prevention measures.
To avoid repeat incidents, the team launched a dedicated cache governance project.
Governance Plan
1. High‑Availability Governance : Business high‑availability should not be fully dependent on a single component. The primary goal is rapid recovery, preferably within 2 minutes for ATP‑impacting scenarios and within 10 minutes for user‑experience‑critical cases.
2. Fast Recovery : Use scheduled tasks or manual API triggers to clean and restore data quickly, prioritizing hot data.
3. Multi‑Replica Strategy : Cache critical data across different Redis namespaces or combine multiple cache components (e.g., Redis + Tair, Redis + in‑memory) so that a failure of one can be switched to a standby.
4. Manual Degradation : Bypass the cache or switch to alternative channels, preferring lossless degradation and resorting to lossy degradation only when necessary.
5. Parameter Tuning : Optimize Redis configuration such as timeout and thread counts. Monitoring showed most read/write operations complete within a few milliseconds, yet many deployments used outdated configurations with hundreds of milliseconds latency.
6. Additional Governance Details :
Replace Memcache with Redis to unify cache operations under DBA management.
Standardize configuration file formats for quick retrieval during incidents.
Enhance monitoring to capture call volume, latency, and error rates for each business scenario.
Governance Process
Identify core scenarios that rely on cache, documenting impact, data volume, and failure modes.
Draft an overall governance solution, then review details with system owners to fill gaps.
Conduct joint reviews among developers, application owners, and QA to finalize standards.
Implement and self‑test the solutions; iterate if issues arise.
Compile emergency handbooks for each scenario during development.
Pass the changes to QA for verification and conduct beta‑environment drills.
Deploy to production and create monitoring dashboards per application.
Perform live‑environment rehearsals during low‑traffic periods, refining the process until expectations are met.
Results and Summary
Most P1 systems in the team have completed cache governance and rehearsals, consuming over 60 person‑days. Developers gained deep Redis knowledge, and the produced wiki, monitoring panels, and emergency handbooks proved valuable for onboarding, daily inspections, and rapid fault isolation.
Notably, a recent incident where hotel image data was corrupted in the DB also polluted Redis. After restoring the DB, the team leveraged the pre‑planned degradation path to switch to a Dubbo service for image retrieval, restoring service within one minute.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
