How Meta Reached 99.99999999% Cache Consistency and What You Can Learn

This article explains Meta's approach to cache invalidation and consistency, why ultra‑high consistency matters for user experience, the monitoring infrastructure they built, the Polaris system that detects and repairs inconsistencies, and provides a concrete Python‑style code example illustrating the problem and solution.

dbaplus Community
dbaplus Community
dbaplus Community
How Meta Reached 99.99999999% Cache Consistency and What You Can Learn

Introduction

Cache is a fundamental technique used throughout computer systems, especially in backend services, to reduce latency, scale workloads, and cut costs. Meta relies heavily on caching, which makes cache invalidation and consistency critical problems.

What Are Cache Invalidation and Consistency?

Because a cache does not store the original data source, any change to the source must trigger an invalidation process; otherwise stale entries remain and diverge from the source. The article illustrates a typical race where a write to the database arrives after a stale value has already been cached, creating inconsistency.

Cache reads from the database.

Before the cached value arrives, the database is updated.

The database emits an invalidation event that reaches the cache before the stale write, so the cache is set to the new value.

The stale write later arrives, overwriting the correct value and causing inconsistency.

Version fields can resolve such conflicts, but at Meta’s scale the solution must handle billions of writes per second.

Why Meta Prioritizes Cache Consistency

For Meta, an inconsistent cache is as bad as lost database data because it directly degrades user experience. The article gives the example of private messages (DMs) on Instagram: if different replicas store different versions of a user’s inbox, messages can be lost or delivered to the wrong region, leading to a poor experience.

Monitoring Cache Consistency

Accurate measurement is the first step. Meta’s monitoring must emit alerts only for genuine inconsistencies; false positives would be ignored by on‑call engineers, rendering the metric useless. Simple logging of every cache state change is infeasible because Meta processes over 10 trillion cache fills per day.

Polaris: The Consistency‑Detection System

Polaris assumes “the cache should eventually be consistent with the database.” When an invalidation event arrives, Polaris queries all replicas to verify that no other violations exist. It aggregates inconsistencies over configurable time windows (e.g., 1 minute, 5 minutes) and reports a metric such as “99.99999999% of writes are consistent within five minutes.”

Code Example Demonstrating the Bug

The following simplified Python‑style code shows a cache that stores data and version numbers, a read path that falls back to the database, and an invalidation routine that may drop stale entries. The example reproduces the race where a stale invalidation overwrites a newer value.

cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4}
def read_value(key):
    value = read_value_from_cache(key)
    if value is not None:
        return value
    else:
        return meta_data_table[key]

def read_value_from_cache(key):
    if key in cache_data:
        return cache_data[key]
    else:
        fill_cache_thread = threading.Thread(target=fill_cache, args=(key,))
        fill_cache_thread.start()
        return None
def fill_cache(key):
    fill_cache_metadata(key)
    fill_cache_version(key)

def fill_cache_metadata(key):
    meta_data = meta_data_table[key]
    print("Filling cache meta data for", meta_data)
    cache_data[key] = meta_data

def fill_cache_version(key):
    time.sleep(2)
    version = version_table[key]
    print("Filling cache version data for", version)
    cache_version[key] = version

def write_value(key, value):
    version = version_table.get(key, 1) + 1
    write_in_database_transactionally(key, value, version)
    time.sleep(3)
    invalidate_cache(key, value, version)

def write_in_database_transactionally(key, data, version):
    meta_data_table[key] = data
    version_table[key] = version

def invalidate_cache(key, metadata, version):
    try:
        cache_data[key] = metadata  # Simulated error
    except:
        drop_cache(key, version)

def drop_cache(key, version):
    if version > cache_version.get(key, 0):
        cache_data.pop(key, None)
        cache_version.pop(key, None)

Multiple threads read and write concurrently, and if the invalidation fails to drop a stale entry, the cache can retain outdated metadata indefinitely.

Consistency Tracking in Practice

When Polaris reports an inconsistency, on‑call engineers first verify whether the cache server received the invalidation request, whether it processed it correctly, and whether the entry became inconsistent. Meta built a state‑tracking library that records only the changes that could lead to inconsistency, reducing logging overhead while still providing enough context to debug bugs quickly.

Conclusion

Reliable monitoring and logging are essential for any distributed system. Meta’s Polaris system detects cache inconsistencies, alerts engineers, and provides enough tracked data to locate the root cause within minutes, demonstrating how ultra‑high cache consistency can be achieved at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendmonitoringCacheConsistencyPolarisMeta
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.