How to Diagnose and Fix Cache Consistency Issues in High‑Concurrency Systems
This article walks through a real‑world cache consistency bug in a high‑traffic push service, explains cache penetration, breakdown, and avalanche, compares strong and eventual consistency models, and presents practical cache update and invalidation strategies to prevent data mismatches.
In the Tmall International Push Center, a null plan exception was traced to a cache consistency problem caused by the configuration service deleting the cache before the push service refreshed it.
Cache Issues Overview
Cache Penetration occurs when requests query data that does not exist, bypassing the cache and hitting the database, potentially overwhelming it.
Mitigation : use a Bloom filter or cache empty results.
Cache Breakdown happens when a hot key expires and many requests simultaneously hit the database.
Mitigation : keep hot data never expires, use locking or a queue to serialize updates.
Cache Avalanche is when many keys expire at once, flooding the database.
Mitigation : randomize expiration times, use high‑availability cache clusters, staticize hot data.
Data Consistency Models
Strong Consistency guarantees that all clients see the same data at any moment; it is suitable for transactional systems and relies on protocols such as three‑phase commit (3PC), distributed locks, or Paxos/Raft algorithms.
Eventual Consistency allows temporary divergence, with mechanisms like asynchronous replication, read‑repair, background synchronization, and versioning.
Cache Update & Invalidation Strategies
Write‑through cache
Write‑back cache
Write‑around cache
Active update (push on DB change)
Timed expiration
Lazy loading
Push Center Architecture
The system consists of a configuration service (writes/updates push plans) and a push service (reads plans). Both use a two‑level cache: local memory and Tair (a Redis‑like store). The configuration service previously used a “delete‑then‑refresh” strategy, creating a window where the push service read a null value.
Investigation Process
Log analysis revealed a regular error burst every five minutes, matching the configuration service’s refresh task. The timing showed that the cache was deleted, the database updated, and before the push service refreshed Tair, it read null.
Solution
Replace the delete‑then‑refresh approach with a double‑write strategy: after updating the database, immediately refresh the cache, or use a delayed double‑delete to reduce the inconsistency window.
Update DB then cache – stronger consistency but possible dirty reads.
Update cache then DB – avoids dirty reads but risks permanent inconsistency if DB update fails.
Delete cache then update DB – simple but creates a consistency window.
Delayed double‑delete – reduces the window but adds operational complexity.
Conclusion
There is no universally best solution; the optimal strategy balances performance, system complexity, and the required consistency level for the specific business scenario.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
