How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris

This article explains Meta's approach to cache invalidation and consistency, detailing why ultra‑high consistency matters, how their Polaris monitoring system detects and resolves inconsistencies, and provides a simplified Python example that illustrates the underlying mechanisms and challenges.

ITPUB
ITPUB
ITPUB
How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris

Introduction

Cache is a fundamental technique used throughout computer systems, from hardware caches to operating‑system and web‑browser caches. In large‑scale backend services, caching reduces latency, handles massive workloads, and cuts costs, but it also creates the problem of cache invalidation.

Meta has raised its cache‑consistency level from six‑nines (99.9999 %) to ten‑nines (99.99999999 %), meaning fewer than one inconsistency per billion writes.

Cache Invalidation and Consistency

Because a cache does not store the true source of data, any change to the source must proactively invalidate stale cache entries. Failure to do so leads to inconsistency between cache and source.

Typical invalidation uses a TTL, but the article assumes invalidation is triggered by an external component.

Example of an inconsistency scenario (timestamps 1‑4):

Cache tries to fetch a value from the database.

Before the value x=42 reaches the cache, the database is updated to x=43.

The database sends an invalidation event for x=43, which arrives before x=42 and sets the cache to 43.

Later the stale event x=42 arrives, overwriting the cache with 42 and causing inconsistency.

A common mitigation is to attach a version field to each entry and only apply updates with a newer version, preventing older events from overwriting newer data. This works for most workloads but can be insufficient at Meta’s scale.

Why Consistency Matters at Scale

From a user‑experience perspective, cache inconsistency can be as severe as data loss. For example, Instagram private‑message routing maps users to the region‑closest storage. If cache replicas in different regions hold divergent data, a message may be routed to a region that lacks the recipient’s data, resulting in lost messages.

Monitoring Cache Consistency

Accurate measurement is the first step. Alerts must be trustworthy; otherwise engineers will ignore them. Recording every cache state change is infeasible for workloads that process >10 trillion cache fills per day, so Meta’s solution records only mutations that could lead to inconsistencies.

Polaris System

Polaris is a high‑level client that assumes no knowledge of the underlying stateful service. Its principle is “the cache should eventually be consistent with the database.” Polaris receives invalidation events, queries all cache replicas, and flags any violations.

Polaris reports inconsistencies on multiple time scales (1 min, 5 min, 10 min). It only queries the database when the number of inconsistent samples exceeds a configurable threshold, avoiding excessive load.

The system emits a metric such as “N nines of consistency within M minutes.” Currently it achieves 99.99999999 % consistency on a five‑minute window.

Example Implementation (Python)

cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4}

def read_value(key):
    value = read_value_from_cache(key)
    if value is not None:
        return value
    else:
        return meta_data_table[key]

def read_value_from_cache(key):
    if key in cache_data:
        return cache_data[key]
    else:
        fill_cache_thread = threading.Thread(target=fill_cache, args=(key,))
        fill_cache_thread.start()
        return None

def fill_cache(key):
    fill_cache_metadata(key)
    fill_cache_version(key)

def fill_cache_metadata(key):
    meta_data = meta_data_table[key]
    print("Filling cache meta data for", meta_data)
    cache_data[key] = meta_data

def fill_cache_version(key):
    time.sleep(2)
    version = version_table[key]
    print("Filling cache version data for", version)
    cache_version[key] = version

def write_value(key, value):
    version = version_table.get(key, 0) + 1
    write_in_database_transactionally(key, value, version)
    time.sleep(3)  # simulate async propagation delay
    invalidate_cache(key, value, version)

def write_in_database_transactionally(key, data, version):
    meta_data_table[key] = data
    version_table[key] = version

def invalidate_cache(key, metadata, version):
    try:
        _ = cache_data[key][metadata]  # intentional error to trigger exception
    except Exception:
        drop_cache(key, version)

def drop_cache(key, version):
    cache_version_value = cache_version.get(key, -1)
    if version > cache_version_value:
        cache_data.pop(key, None)
        cache_version.pop(key, None)

# Example usage
read_thread = threading.Thread(target=read_value, args=("1",))
write_thread = threading.Thread(target=write_value, args=("1", 43))
read_thread.start()
write_thread.start()

Consistency Tracking

When a Polaris alert fires, on‑call engineers examine the selective logs to locate the root cause. By logging only mutations that could cause inconsistencies, engineers can pinpoint bugs within ~30 minutes without overwhelming the system.

Conclusion

Reliable monitoring and selective logging are essential for any distributed system to capture and quickly resolve cache‑related bugs. Meta’s Polaris system demonstrates how precise inconsistency detection, multi‑time‑scale alerts, and targeted consistency tracking can achieve near‑perfect cache consistency at massive scale.

Reference: https://engineering.fb.com/2022/06/08/core-infra/cache-made-consistent/

Cache inconsistency diagram
Cache inconsistency diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendDistributed SystemsmonitoringConsistencyPolarisMeta
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.