How Meta Reached 99.99999999% Cache Consistency and What You Can Learn
This article explains Meta's approach to cache invalidation and consistency, why ultra‑high consistency matters for user experience, the monitoring infrastructure they built, the Polaris system that detects and repairs inconsistencies, and provides a concrete Python‑style code example illustrating the problem and solution.
Introduction
Cache is a fundamental technique used throughout computer systems, especially in backend services, to reduce latency, scale workloads, and cut costs. Meta relies heavily on caching, which makes cache invalidation and consistency critical problems.
What Are Cache Invalidation and Consistency?
Because a cache does not store the original data source, any change to the source must trigger an invalidation process; otherwise stale entries remain and diverge from the source. The article illustrates a typical race where a write to the database arrives after a stale value has already been cached, creating inconsistency.
Cache reads from the database.
Before the cached value arrives, the database is updated.
The database emits an invalidation event that reaches the cache before the stale write, so the cache is set to the new value.
The stale write later arrives, overwriting the correct value and causing inconsistency.
Version fields can resolve such conflicts, but at Meta’s scale the solution must handle billions of writes per second.
Why Meta Prioritizes Cache Consistency
For Meta, an inconsistent cache is as bad as lost database data because it directly degrades user experience. The article gives the example of private messages (DMs) on Instagram: if different replicas store different versions of a user’s inbox, messages can be lost or delivered to the wrong region, leading to a poor experience.
Monitoring Cache Consistency
Accurate measurement is the first step. Meta’s monitoring must emit alerts only for genuine inconsistencies; false positives would be ignored by on‑call engineers, rendering the metric useless. Simple logging of every cache state change is infeasible because Meta processes over 10 trillion cache fills per day.
Polaris: The Consistency‑Detection System
Polaris assumes “the cache should eventually be consistent with the database.” When an invalidation event arrives, Polaris queries all replicas to verify that no other violations exist. It aggregates inconsistencies over configurable time windows (e.g., 1 minute, 5 minutes) and reports a metric such as “99.99999999% of writes are consistent within five minutes.”
Code Example Demonstrating the Bug
The following simplified Python‑style code shows a cache that stores data and version numbers, a read path that falls back to the database, and an invalidation routine that may drop stale entries. The example reproduces the race where a stale invalidation overwrites a newer value.
cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4} def read_value(key):
value = read_value_from_cache(key)
if value is not None:
return value
else:
return meta_data_table[key]
def read_value_from_cache(key):
if key in cache_data:
return cache_data[key]
else:
fill_cache_thread = threading.Thread(target=fill_cache, args=(key,))
fill_cache_thread.start()
return None def fill_cache(key):
fill_cache_metadata(key)
fill_cache_version(key)
def fill_cache_metadata(key):
meta_data = meta_data_table[key]
print("Filling cache meta data for", meta_data)
cache_data[key] = meta_data
def fill_cache_version(key):
time.sleep(2)
version = version_table[key]
print("Filling cache version data for", version)
cache_version[key] = version
def write_value(key, value):
version = version_table.get(key, 1) + 1
write_in_database_transactionally(key, value, version)
time.sleep(3)
invalidate_cache(key, value, version)
def write_in_database_transactionally(key, data, version):
meta_data_table[key] = data
version_table[key] = version
def invalidate_cache(key, metadata, version):
try:
cache_data[key] = metadata # Simulated error
except:
drop_cache(key, version)
def drop_cache(key, version):
if version > cache_version.get(key, 0):
cache_data.pop(key, None)
cache_version.pop(key, None)Multiple threads read and write concurrently, and if the invalidation fails to drop a stale entry, the cache can retain outdated metadata indefinitely.
Consistency Tracking in Practice
When Polaris reports an inconsistency, on‑call engineers first verify whether the cache server received the invalidation request, whether it processed it correctly, and whether the entry became inconsistent. Meta built a state‑tracking library that records only the changes that could lead to inconsistency, reducing logging overhead while still providing enough context to debug bugs quickly.
Conclusion
Reliable monitoring and logging are essential for any distributed system. Meta’s Polaris system detects cache inconsistencies, alerts engineers, and provides enough tracked data to locate the root cause within minutes, demonstrating how ultra‑high cache consistency can be achieved at massive scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
