How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris
This article explains Meta's approach to cache invalidation and consistency, detailing why ultra‑high consistency matters, how their Polaris monitoring system detects and resolves inconsistencies, and provides a simplified Python example that illustrates the underlying mechanisms and challenges.
Introduction
Cache is a fundamental technique used throughout computer systems, from hardware caches to operating‑system and web‑browser caches. In large‑scale backend services, caching reduces latency, handles massive workloads, and cuts costs, but it also creates the problem of cache invalidation.
Meta has raised its cache‑consistency level from six‑nines (99.9999 %) to ten‑nines (99.99999999 %), meaning fewer than one inconsistency per billion writes.
Cache Invalidation and Consistency
Because a cache does not store the true source of data, any change to the source must proactively invalidate stale cache entries. Failure to do so leads to inconsistency between cache and source.
Typical invalidation uses a TTL, but the article assumes invalidation is triggered by an external component.
Example of an inconsistency scenario (timestamps 1‑4):
Cache tries to fetch a value from the database.
Before the value x=42 reaches the cache, the database is updated to x=43.
The database sends an invalidation event for x=43, which arrives before x=42 and sets the cache to 43.
Later the stale event x=42 arrives, overwriting the cache with 42 and causing inconsistency.
A common mitigation is to attach a version field to each entry and only apply updates with a newer version, preventing older events from overwriting newer data. This works for most workloads but can be insufficient at Meta’s scale.
Why Consistency Matters at Scale
From a user‑experience perspective, cache inconsistency can be as severe as data loss. For example, Instagram private‑message routing maps users to the region‑closest storage. If cache replicas in different regions hold divergent data, a message may be routed to a region that lacks the recipient’s data, resulting in lost messages.
Monitoring Cache Consistency
Accurate measurement is the first step. Alerts must be trustworthy; otherwise engineers will ignore them. Recording every cache state change is infeasible for workloads that process >10 trillion cache fills per day, so Meta’s solution records only mutations that could lead to inconsistencies.
Polaris System
Polaris is a high‑level client that assumes no knowledge of the underlying stateful service. Its principle is “the cache should eventually be consistent with the database.” Polaris receives invalidation events, queries all cache replicas, and flags any violations.
Polaris reports inconsistencies on multiple time scales (1 min, 5 min, 10 min). It only queries the database when the number of inconsistent samples exceeds a configurable threshold, avoiding excessive load.
The system emits a metric such as “N nines of consistency within M minutes.” Currently it achieves 99.99999999 % consistency on a five‑minute window.
Example Implementation (Python)
cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4}
def read_value(key):
value = read_value_from_cache(key)
if value is not None:
return value
else:
return meta_data_table[key]
def read_value_from_cache(key):
if key in cache_data:
return cache_data[key]
else:
fill_cache_thread = threading.Thread(target=fill_cache, args=(key,))
fill_cache_thread.start()
return None
def fill_cache(key):
fill_cache_metadata(key)
fill_cache_version(key)
def fill_cache_metadata(key):
meta_data = meta_data_table[key]
print("Filling cache meta data for", meta_data)
cache_data[key] = meta_data
def fill_cache_version(key):
time.sleep(2)
version = version_table[key]
print("Filling cache version data for", version)
cache_version[key] = version
def write_value(key, value):
version = version_table.get(key, 0) + 1
write_in_database_transactionally(key, value, version)
time.sleep(3) # simulate async propagation delay
invalidate_cache(key, value, version)
def write_in_database_transactionally(key, data, version):
meta_data_table[key] = data
version_table[key] = version
def invalidate_cache(key, metadata, version):
try:
_ = cache_data[key][metadata] # intentional error to trigger exception
except Exception:
drop_cache(key, version)
def drop_cache(key, version):
cache_version_value = cache_version.get(key, -1)
if version > cache_version_value:
cache_data.pop(key, None)
cache_version.pop(key, None)
# Example usage
read_thread = threading.Thread(target=read_value, args=("1",))
write_thread = threading.Thread(target=write_value, args=("1", 43))
read_thread.start()
write_thread.start()Consistency Tracking
When a Polaris alert fires, on‑call engineers examine the selective logs to locate the root cause. By logging only mutations that could cause inconsistencies, engineers can pinpoint bugs within ~30 minutes without overwhelming the system.
Conclusion
Reliable monitoring and selective logging are essential for any distributed system to capture and quickly resolve cache‑related bugs. Meta’s Polaris system demonstrates how precise inconsistency detection, multi‑time‑scale alerts, and targeted consistency tracking can achieve near‑perfect cache consistency at massive scale.
Reference: https://engineering.fb.com/2022/06/08/core-infra/cache-made-consistent/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
