Big Data 31 min read

The Four Phases of Netflix’s Trillion‑Scale Real‑Time Data Infrastructure

This article chronicles Netflix’s evolution from a failing batch pipeline to a cloud‑native, multi‑tenant streaming platform across four phases, detailing the motivations, challenges, strategic bets, and patterns that enabled the company to scale real‑time data processing to trillions of events per day.

Architects Research Society
Architects Research Society
Architects Research Society
The Four Phases of Netflix’s Trillion‑Scale Real‑Time Data Infrastructure

Xu Zhenzhong, a founding engineer of Netflix’s real‑time data infrastructure team, reflects on his experience building a streaming‑first platform that grew from zero to over 2,000 use cases, delivering products such as Keystone, a managed Flink platform, Mantis, and a managed Kafka service.

The journey is divided into four phases:

Phase 1 – Rescue Netflix logs from a failing batch pipeline (2015)

During rapid growth, the existing Chukwa/Hadoop/Hive batch pipeline could not keep up with the surge to 5 000 billion events per day. The team built Keystone, a stream‑first architecture, to move logs from edge to a data warehouse, reducing latency for analytics and operations.

Key challenges included a six‑month deadline, limited resources, and an immature streaming ecosystem (Kafka, Samza, Flink). Strategic bets focused on building an MVP for a few high‑priority internal customers, partnering with external streaming experts, and separating concerns between producers and consumers.

Phase 2 – Scale to hundreds of data‑movement use cases (2016)

After the Keystone MVP, demand for data movement grew. The team built a self‑serve, fully‑managed platform supporting 100+ use cases, emphasizing simplicity, automation, and multi‑tenant isolation.

Challenges included increasing operational burden, diverse customer requirements, and frequent breakages of dependent services. Strategic bets emphasized simplicity over exposing infrastructure complexity, investing in a multi‑tenant control plane, and accelerating DevOps practices.

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Data Routing          | Keystone | Logging, Data Movement       |
|                       |          | (+ At scale)                 |
| RT Data Sampling/     | Mantis   | Cost-effective RT Insights   |
| Discovery             |          |                              |
| RT Alerts / Dashboard | Mantis,  | SPS Alert,                   |
|                       | Kafka    | + Infrastructure Health      |
|                       |          | Monitoring (Cassandra &      |
|                       |          | Elasticsearch),              |
|                       |          | +RT QoE monitoring           |
+-----------------------------------------------------------------+

Phase 3 – Support custom workloads and exceed 1,000 use cases (2017‑2019)

Netflix’s platform needed to handle custom streaming jobs such as real‑time recommendation label computation and large‑scale joins, requiring advanced windowing, state management, and observability.

Challenges involved balancing flexibility with simplicity, increasing operational complexity, and convincing teams to migrate from local solutions to the central platform.

Strategic bets included building a new Flink‑based entry point, focusing first on streaming ETL and observability use cases, and gradually sharing operational responsibility with customers.

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Stream-to-stream Joins| Flink    | Take‑fraction computation,   |
| (ETL)                 |          | Recsys label computation     |
| Stream-to-table joins | Flink    | Side input: join streams with|
| (ETL)                 |          | slow‑moving Iceberg table    |
| Streaming Sessionizat-| Flink    | Personalization Sessionizat-|
| ion (ETL)             |          | ion, Metrics sessionization  |
| RT Observability      | Mantis   | Distributed tracing,         |
|                       |          | Chaos EXPER monitoring,       |
|                       |          | Application monitoring       |
+-----------------------------------------------------------------+

Phase 4 – Expand streaming responsibilities (2020‑present)

With streaming now pervasive across Netflix, new challenges arise: coordinating diverse data technologies, steep learning curves, under‑utilized ML pipelines, and scaling the central platform model.

Opportunities include stream‑to‑stream data integration, higher‑level abstractions (e.g., streaming SQL), and tighter support for machine‑learning workflows.

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Streaming Backfill /  | Flink    | Pipeline Failure mitigation, |
| Restatement           |          | Avoid cold start             |
| Data Quality Control  | Keystone,| Schema evolution management, |
|                       | Flink    | Data Quality SLA,            |
|                       |          | Cost reduction via Avro      |
| Source/Sink Agnostic  | Keystone,| Delta, Data Mesh,            |
| Data Synchronization  | Flink    | Operational reporting,       |
| Near‑real‑time (NRT)  | Flink    | Customer service recommendation |
| Streaming SQL         | Flink    | Dynamic feature engineering   |
+-----------------------------------------------------------------+

The author concludes by inviting feedback, noting that many technical details are omitted, and announcing a new venture building a streaming‑first ML platform at Netflix, seeking a founding infrastructure engineer.

cloud nativeBig Datastream processingreal-time dataNetflixData Infrastructure
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.