How to Diagnose and Fix Cache Consistency Issues in High‑Concurrency Systems

This article walks through a real‑world cache consistency bug in a high‑traffic push service, explains cache penetration, breakdown, and avalanche, compares strong and eventual consistency models, and presents practical cache update and invalidation strategies to prevent data mismatches.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Diagnose and Fix Cache Consistency Issues in High‑Concurrency Systems

In the Tmall International Push Center, a null plan exception was traced to a cache consistency problem caused by the configuration service deleting the cache before the push service refreshed it.

Cache Issues Overview

Cache Penetration occurs when requests query data that does not exist, bypassing the cache and hitting the database, potentially overwhelming it.

Mitigation : use a Bloom filter or cache empty results.

Cache Breakdown happens when a hot key expires and many requests simultaneously hit the database.

Mitigation : keep hot data never expires, use locking or a queue to serialize updates.

Cache Avalanche is when many keys expire at once, flooding the database.

Mitigation : randomize expiration times, use high‑availability cache clusters, staticize hot data.

Data Consistency Models

Strong Consistency guarantees that all clients see the same data at any moment; it is suitable for transactional systems and relies on protocols such as three‑phase commit (3PC), distributed locks, or Paxos/Raft algorithms.

Eventual Consistency allows temporary divergence, with mechanisms like asynchronous replication, read‑repair, background synchronization, and versioning.

Cache Update & Invalidation Strategies

Write‑through cache

Write‑back cache

Write‑around cache

Active update (push on DB change)

Timed expiration

Lazy loading

Push Center Architecture

The system consists of a configuration service (writes/updates push plans) and a push service (reads plans). Both use a two‑level cache: local memory and Tair (a Redis‑like store). The configuration service previously used a “delete‑then‑refresh” strategy, creating a window where the push service read a null value.

Investigation Process

Log analysis revealed a regular error burst every five minutes, matching the configuration service’s refresh task. The timing showed that the cache was deleted, the database updated, and before the push service refreshed Tair, it read null.

Solution

Replace the delete‑then‑refresh approach with a double‑write strategy: after updating the database, immediately refresh the cache, or use a delayed double‑delete to reduce the inconsistency window.

Update DB then cache – stronger consistency but possible dirty reads.

Update cache then DB – avoids dirty reads but risks permanent inconsistency if DB update fails.

Delete cache then update DB – simple but creates a consistency window.

Delayed double‑delete – reduces the window but adds operational complexity.

Conclusion

There is no universally best solution; the optimal strategy balances performance, system complexity, and the required consistency level for the specific business scenario.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendDistributed SystemsCachehigh concurrencyConsistencyCache Strategies
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.