How Netflix’s Write‑Ahead Log Powers a Resilient Data Platform at Massive Scale
Netflix built a distributed write‑ahead log (WAL) system to guarantee data consistency, durability, and high‑throughput across services, tackling challenges like data loss, cross‑store entropy, multi‑partition updates, replication, and reliable retry for real‑time pipelines.
Introduction
Operating at Netflix’s massive scale requires a data platform that can keep billions of users’ content and features consistent, reliable, and efficient. A core component of this platform is a distributed write‑ahead log (WAL) that abstracts persistence and delivery of data changes to downstream consumers.
Key Challenges
Unexpected data loss and corruption in databases.
System entropy when writing to multiple stores (e.g., Cassandra and Elasticsearch).
Updating many partitions (e.g., secondary indexes built on top of NoSQL stores).
In‑region and cross‑region data replication.
Reliable retry mechanisms for large‑scale real‑time pipelines.
Batch deletions that exhaust key‑value node memory.
These issues often caused production incidents, required custom solutions, and generated technical debt. A single ALTER TABLE mistake once corrupted data, but a cache‑TTL extension and Kafka write‑through allowed rapid recovery.
WAL Overview
The WAL is a distributed system that captures data changes, provides strong durability guarantees, and reliably forwards those changes to downstream consumers. It supports use cases such as secondary indexing, cross‑region replication for non‑replicated stores, and delayed‑queue semantics.
API
The public API is intentionally simple, exposing only the necessary parameters. The primary endpoint is WriteToLog:
rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse) { ... } /** WAL request message */
message WriteToLogRequest {
string namespace = 1;
Lifecycle lifecycle = 2;
bytes payload = 3;
Target target = 4;
}
/** WAL response message */
message WriteToLogResponse {
Trilean durable = 1;
string message = 2;
}A namespace defines where and how data is stored, allowing logical isolation and per‑namespace queue selection (Kafka, SQS, or a combination). Namespaces also hold configuration such as back‑off multipliers and max retry attempts.
Roles and Namespace Configurations
Role 1 – Delayed Queue
For a product‑data system that uses SQS, the namespace configuration enables delayed messages:
{
"persistenceConfigurations": {
"persistenceConfiguration": [
{
"physicalStorage": { "type": "SQS" },
"config": {
"wal-queue": ["dgwwal-dq-pds"],
"wal-dlq-queue": ["dgwwal-dlq-pds"],
"queue.poll-interval.secs": 10,
"queue.max-messages-per-poll": 100
}
}
]
}
}Role 2 – General Cross‑Region Replication
For EVCache cross‑region replication the namespace uses Kafka as the underlying transport:
{
"persistence_configurations": {
"persistence_configuration": [
{
"physical_storage": { "type": "KAFKA" },
"config": {
"consumer_stack": "consumer",
"context": "cross region replication for evcache_foobar",
"target": {
"euwest1": "dgwwal.foobar.cluster.eu-west-1.netflix.net",
"useast1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
"useast2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
"uswest2": "dgwwal.foobar.cluster.us-west-2.netflix.net"
},
"wal-kafka-topics": ["evcache_foobar"],
"wal.kafka.bootstrap.servers.prefix": "kafka-foobar"
}
}
]
}
}Role 3 – Multi‑Partition Updates
For a key‑value service that needs multi‑ID or multi‑table mutations, the namespace includes both Kafka and durable storage to support a two‑phase commit:
{
"persistence_configurations": {
"persistence_configuration": [
{
"physical_storage": { "type": "KAFKA" },
"config": {
"consumer_stack": "consumer",
"context": "WAL to support multi-id/namespace mutations for dgwkv.foobar",
"durable_storage": {
"namespace": "foobar_wal_type",
"shard": "walfoobar",
"type": "kv"
},
"wal-kafka-topics": ["foobar_kv_multi_id"],
"wal.kafka.bootstrap.servers.prefix": "kaas_kafka-dgwwal_foobar7102"
}
}
]
}
}All WAL requests guarantee at‑least‑once semantics due to the underlying implementation.
Underlying Principles
The architecture consists of several cooperating components:
Producer‑Consumer Separation: Producers ingest client messages into a queue; consumers read from the queue and deliver to the target store (Cassandra, Memcached, Kafka, etc.). The control plane makes the producer/consumer model pluggable.
Default Dead‑Letter Queues: Each namespace’s queue has an associated DLQ to handle transient and hard errors.
Flexible Targeting: Messages can be routed to any target (database, cache, another queue, or upstream service) based on namespace configuration.
These principles enable a plug‑in architecture where the underlying queue technology can be swapped without code changes.
Deployment Model
WAL runs on Netflix’s Data‑Gateway infrastructure, inheriting mTLS, connection management, authentication, and runtime configuration. Each WAL instance is deployed as a shard; a shard is a set of physical machines. Different services (e.g., ad‑event service, game‑directory service) get dedicated shards to avoid “noisy‑neighbor” interference.
Shards can host multiple namespaces, and operators can add more queues to a namespace if throughput becomes a bottleneck. The configuration is stored in a globally replicated relational database for consistency.
Auto‑scaling groups adjust producer and consumer instance counts based on CPU and network thresholds, using Netflix’s adaptive load‑shedding library and Envoy for request throttling. WAL can be deployed across multiple regions, each with its own instance group.
Key Use Cases
Delayed Queue
Applications that need to execute a request after a delay write to WAL, which guarantees delivery after the specified latency. Example: Netflix Live Origin offloads bulk delete requests to WAL with random jitter, smoothing the request curve and preventing Cassandra overload.
Cross‑Region Replication
WAL enables global replication for services like EVCache. A client writes to its local WAL shard; the WAL consumer reads from Kafka and forwards the change to target regions, where writer groups apply the mutation to regional EVCache clusters.
Multi‑Table Mutations
Key‑Value services use WAL to implement a MutateItems API that supports atomic multi‑table, multi‑ID changes via a two‑phase commit. The API definition:
message MutateItemsRequest {
repeated MutationRequest mutations = 1;
message MutationRequest {
oneof mutation {
PutItemsRequest put = 1;
DeleteItemsRequest delete = 2;
}
}
}WAL guarantees eventual consistency: each operation is assigned a sequence number, persisted, and later replayed by consumers. Failed mutations are re‑queued for retry.
Conclusions
Building a generic WAL taught Netflix several lessons:
Pluggable architecture is essential; configuration‑driven targets avoid code changes.
Leverage existing control‑plane and key‑value abstractions to focus on WAL‑specific challenges.
Separate concerns (producer vs. consumer) to enable independent scaling and reduce blast radius.
Expect failures: WAL itself can experience traffic spikes, slow consumers, and non‑transient errors, requiring careful trade‑offs and back‑pressure strategies.
Future Work
Planned extensions include adding secondary indexes to the key‑value service and using WAL to fan‑out requests to multiple data stores (e.g., database plus backup or queue).
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
