How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris
This article explains why cache consistency is critical for Meta, how the company measures and monitors consistency, the design of the Polaris system that detects and resolves stale cache entries, and provides a concrete Python‑style example illustrating the challenges and solutions.
Introduction
Cache is a fundamental technique for reducing latency and scaling workloads in large‑scale backend systems. Meta relies on cache heavily, and cache invalidation—ensuring that stale entries are removed when the underlying data changes—is a critical reliability problem. Meta has improved cache consistency from six‑nines (99.9999 %) to ten‑nines (99.99999999 %), meaning fewer than one inconsistency per 10 billion writes.
Cache Invalidation and Consistency
Cache stores a copy of data, not the authoritative source. When the source updates, the cache must be invalidated; otherwise divergent values appear. A common approach is to set a TTL, but Meta assumes invalidation is triggered by external systems.
Typical inconsistency scenario (timestamps 1‑4 increase):
Cache attempts to fill a key from the database.
Before the value 42 reaches the cache, the database updates the key to 43.
The database sends an invalidation for version 43; it arrives before the pending fill for 42, so the cache stores 43.
The earlier invalidation for version 42 arrives later and overwrites the cache with 42, creating inconsistency.
Version fields can mitigate this race condition, but at Meta’s scale even version‑based conflict resolution can be insufficient.
Why Consistency Matters
From Meta’s perspective, a cache inconsistency is as severe as data loss.
From the user’s perspective, inconsistency leads to broken experiences, such as messages being delivered to a region that lacks the recipient’s prior messages.
Monitoring
Accurate measurement of cache consistency is essential. Logging every cache state change is infeasible at >10 trillion writes per day; instead, Meta records only events that could cause state changes and raises alerts when inconsistencies are detected. The primary metric is “99.99999999 % of cache writes are consistent within M minutes.”
Polaris System
Polaris acts as a client of stateful services without assuming knowledge of their internal structure. Upon receiving an invalidation, Polaris queries all replicas to verify whether any violate the invariant that the cache eventually matches the database. It marks inconsistent replicas, re‑fetches the correct data, and reports violations over configurable time windows (e.g., 1 min, 5 min, 10 min).
Example Scenario
If Polaris receives an invalidation for key x at version 4 but finds no entry, it marks the key as inconsistent. Two possibilities exist:
The key was invisible at version 3 and the version 4 write is the latest—this is a true inconsistency.
A later version 5 write deleted the key, so Polaris sees a newer view than the invalidation.
Polaris delays expensive database checks until the inconsistency persists beyond a threshold (e.g., 1–5 minutes) to avoid overloading the database.
Code Simulation
The following Python‑like snippet reproduces a cache‑invalidation race and demonstrates how Polaris‑style checks could be added.
cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4}
def read_value(key):
value = read_value_from_cache(key)
if value is not None:
return value
return meta_data_table[key]
def read_value_from_cache(key):
if key in cache_data:
return cache_data[key]
# Asynchronously fill cache
threading.Thread(target=fill_cache, args=(key,)).start()
return None
def fill_cache(key):
cache_data[key] = meta_data_table[key]
cache_version[key] = version_table[key]
def write_value(key, value):
version = version_table.get(key, 0) + 1
write_in_database(key, value, version)
time.sleep(3) # Simulate write latency
invalidate_cache(key, version)
def write_in_database(key, data, version):
meta_data_table[key] = data
version_table[key] = version
def invalidate_cache(key, version):
try:
# Intentional error to simulate failure path
_ = cache_data[key]["value"]
except Exception:
drop_cache(key, version)
def drop_cache(key, version):
if version > cache_version.get(key, -1):
cache_data.pop(key, None)
cache_version.pop(key, None)If the invalidation fails, the exception handler may drop the cache conditionally, leaving stale metadata and causing inconsistency.
Consistency Tracking
Polaris logs only events that could change cache state. Operators verify three questions when an alert fires:
Did the cache server receive the invalidation?
Did it process the invalidation correctly?
Did the entry become inconsistent after processing?
Meta built a state‑tracking library that records cache mutations; complex interactions that trigger errors are captured for post‑mortem analysis.
Conclusion
Reliable monitoring and selective logging are indispensable for any distributed system. Polaris provides rapid alerts, and the consistency‑tracking data enables on‑call engineers to pinpoint root causes within minutes.
References
Meta engineering article: https://engineering.fb.com/2022/06/08/core-infra/cache-made-consistent/
Sample implementation repository: https://github.com/Mayank-Sharma-27/meta-cache-made-consistent
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
