Design and Implementation of a Systematic Cache Architecture for High‑Performance Business Systems
This article describes how a multi‑layered cache system—including a DAO‑level Redis cache, shared cache components, and global cache planning—was designed, implemented, and deployed to reduce database QPS by over 80%, cut response latency, and improve overall system stability in a high‑traffic e‑commerce platform.
Author Chen Li joined Qunar Travel in 2020 and now leads the foundational technology and performance optimization for the ticket value‑added business, bringing over ten years of full‑stack development experience.
In high‑performance system design, caching is indispensable, but a single cache solution rarely solves all performance problems; different business scenarios require a combination of cache strategies. In the first half of 2021, Qunar upgraded its cache architecture, achieving more than 80% reduction in database QPS, decreasing interface latency from 150 ms to 90 ms, cutting RPC calls between subsystems by over 40%, and maintaining cache hit rates above 80%.
1. The First Cache Component Development
1.1 Development Background
During the post‑Christmas traffic rebound in 2020, the DB QPS of the value‑added system exceeded 30 k per instance, causing core interface latency spikes and high load. Investigation revealed massive IO blocking, thread counts over 2000, and heavy context‑switch overhead. The most effective optimization was adding a specialized DAO‑layer cache.
Existing Redis and Guava caches were insufficient for DB‑level caching due to:
DB connection overload and alerts.
Large codebase (over 300 k lines) with many branches; caching at the entry point yields low hit rates, while only six critical tables benefit from targeted caching.
Long, iterative processes with repeated DB queries, making refactoring costly.
Stringent consistency requirements for transaction systems.
Traditional DB‑related cache solutions also have limitations:
MySQL query cache invalidates all entries on any data change, unsuitable for frequent order updates.
MyBatis second‑level cache has poor consistency and coarse granularity.
Annotation‑based Redis caching lacks distributed consistency.
Binlog‑based caching introduces second‑level latency, unacceptable for transaction systems.
1.2 Design Considerations
The component, named daoCache, sits on the DAO layer and aims to offload DB queries to Redis at roughly a 1:1 ratio while preserving high consistency in a distributed environment.
Consistency theory (CAP, BASE) dictates a balance between strong consistency, development cost, and performance overhead. Extensive analysis was performed to mitigate risks.
Internal factors: Dedicated data‑layer developers provide high‑quality interfaces; the design minimizes impact on business‑layer code.
External factors: Company policies restrict Redis client capabilities (no batch ops, Lua scripts disabled), influencing interaction patterns.
Consistency handling: Four update patterns (DB→Cache, Cache→DB, Delete‑Cache→DB, DB→Delete‑Cache) all have concurrency challenges; the article does not detail each.
Performance considerations include minimizing resource calls, maintaining cache hit rates above 30%, handling exceptional flows without corrupt data, and reducing lock contention.
Cost considerations focus on lightweight solutions due to rising traffic.
1.3 Implementation Details
The cache uses Redis as the middleware, integrated via a MyBatis plugin that scans DAO methods annotated with daoCache at startup, pre‑analyzes intercept points, and applies the following flow:
Read path: cache hit incurs minimal overhead; miss adds DB access plus 2‑3 Redis calls.
Cache updates are passive (deletion only); data is repopulated on miss.
No distributed locks; an auto‑expire marker (default 10 s) prevents ABA problems.
All cached entries have expiration (10‑20 min) for eventual consistency.
Additional safeguards handle edge cases such as immediate post‑put checks, refund‑related queries, and special handling for order‑creation threads.
Key data structures map business keys (e.g., orderId) to cache entries using Redis sets and hashes, enabling efficient invalidation of all related cache items when a business key changes.
1.4 Deployment
Gradual rollout is performed by enabling the cache for low‑risk, low‑update APIs first, using feature switches to revert if needed. Continuous diffing between cache and DB results identifies inconsistencies, triggering alerts and automatic cache cleaning.
2. Global Cache System Design and Planning
2.1 Background
After the first cache, new issues emerged: slow service startup and performance jitter caused by full‑load product‑cache refreshes across multiple microservices.
Problems identified:
Scattered cache usage makes maintenance difficult and hampers global tuning.
Shared cache keys across services lead to reluctance in cleaning unused keys, causing resource waste.
2.2 Cache System Overview
Cache is divided into four layers:
Level‑1 cache: In‑process memory cache (no I/O), suitable for short‑lived, stateless requests.
Level‑2 cache: External shared cache (Redis) used by multiple instances or services.
Application‑specific cache: Dedicated to a single service for high‑sensitivity data (locks, queues, idempotency).
Shared memory cache: Multi‑service read‑only hot data with eventual consistency.
3. Shared Cache Component Development
3.1 Architecture Overview
Shared cache adopts a Provider‑Consumer model. Providers own the data source and can read/write the cache; Consumers only read and trigger a provider fetch on a miss.
Redis serves both as the second‑level cache and as a registration center.
3.2 Value Proposition
The solution targets low‑frequency updates with extremely high read rates, offering final consistency within seconds and robust fallback mechanisms.
3.3 Core Implementation
Shared cache reuses the generic cache framework but adds a business‑key‑centric data model. Redis stores metadata, full data versions, and supports byte‑prefix versioning. Locally, four indexes (primary‑key, conditional set, full‑set, reverse index) enable fast look‑ups and incremental updates.
Consistency is ensured by:
MQ‑driven local cache invalidation based on changed business keys.
Provider‑side marking and periodic cleanup of stale markers.
Periodic comparison of recent business‑key changes between local nodes and Redis.
Leader‑node full sync to Redis every 20 minutes as a safety net.
Manual or scheduled full‑cache refresh for major releases.
3.4 Typical Use Cases
Shared cache excels in scenarios where many microservices need consistent, read‑heavy data (e.g., product catalogs). It reduces message traffic by deduplicating and batching MQ notifications, avoiding network storms during bulk updates.
4. Conclusion
Deploying the cache system dramatically reduced DB QPS and eliminated peak‑time spikes, with near‑zero inconsistency incidents. Limitations include placement at the request flow tail, DAO‑specific constraints, and the need for careful annotation configuration. Nonetheless, the design principles are reusable for other caching solutions, and future work aims to build a universal cache‑solution library.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
