How Leading Companies Implement Distributed Caching – Lessons and Pitfalls
The article examines the benefits and hidden risks of using distributed caches, analyzes real‑world failure cases from Facebook, Google and AWS, and provides detailed strategies to mitigate availability and consistency problems through careful refresh policies, incremental updates, and compensation mechanisms.
Cache technology speeds up data access and is used by almost every software system. Since its introduction in 1968, many cache frameworks have emerged, especially in distributed architectures where cache reliability is critical.
Historical Failures Highlighting Risks
2012: Facebook’s Memcached update error displayed incorrect information to users.
2013: Google Spanner cache update failure made millions of users unable to use Google services.
2016: AWS Elastic Load Balancer cache inconsistency caused widespread service outages.
These incidents raise two key questions: Do we truly understand how to use cache, and can every scenario tolerate its drawbacks?
Availability Risks Introduced by Cache
Adding a cache layer creates an additional point of failure, making the system more fragile. In a member‑system example, a naïve cache‑refresh strategy that expires all entries simultaneously triggers a massive burst of database queries, overloading the DB and causing a service outage.
Solution: stagger expiration times by adding random jitter to the TTL, thereby spreading refresh traffic and preventing a spike.
Cache‑Refresh Pitfalls and Safer Approaches
A faulty refresh can leave the cache empty; subsequent remote calls (e.g., userService.queryAllUsers) may fail due to network jitter, causing the cache to remain empty and triggering a cascade of refresh attempts that snowball into a service‑wide avalanche.
Improved method: after a successful remote call, assign the result to the cache variable. However, full‑refresh still stresses resources.
Better practice: push incremental updates when the underlying data changes, reducing load and keeping the cache more up‑to‑date.
Local Cache Risks
When the cache JAR is embedded in the client, excessive refresh tasks can dramatically increase load, and large cached data volumes may cause frequent Full GC, degrading response time and throughput.
Distributed Cache Penetration
Storing all data in a distributed cache improves query speed, but caching missing entries without a placeholder leads to repeated DB lookups and potential snowball effects that can overwhelm the database.
Mitigation: pre‑warm the cache and store special empty objects for non‑existent data to block penetration.
Data‑Inconsistency Scenarios
The article enumerates five typical patterns:
Pure write: after a transaction, write‑through to cache; failures may leave cache empty, requiring a compensation read‑write.
Pure delete: delete cache first, then DB; if DB commit fails, inconsistency arises. Deleting cache last avoids this.
Read‑then‑write: on cache miss, read DB and write result back; this pattern maintains consistency for read‑only workloads.
Pure update: non‑atomic DB and cache updates lead to windows of inconsistency; only eventual consistency can be achieved.
Combined operations: concurrent reads during a delete can repopulate stale data, creating brief inconsistency.
Reducing the Inconsistency Window
Recommended steps:
Within the local transaction, update the DB and persist a cache‑compensation task containing the new data.
After the transaction commits, execute the compensation task to push the latest model into the distributed cache.
Include version information (e.g., last‑modified timestamp) in cached entries to detect staleness.
On cache miss, trigger a compensation flow that reads from the primary DB and writes back to the cache.
Additional safeguards:
Use exclusive locks during updates or deletions to prevent concurrent dirty writes.
If exclusive locks are unavailable, read the cache again before writing to merge changes.
Always run compensation tasks against the primary database to avoid replication lag.
Guidelines for When Not to Use Cache
Cache is unsuitable for scenarios demanding strong timeliness or zero data loss (e.g., user balances). It cannot replace a database’s ACID guarantees, especially for idempotent operations.
In summary, cache is a powerful tool but must be applied judiciously, with careful design to mitigate availability risks, consistency gaps, and resource pressure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer XiaoFu
xiaofucode.com – a programmer learning guide driven by the pursuit of profit
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
