Design Analysis of Netflix’s Cloud‑Based Microservices Architecture
The article examines Netflix’s migration to AWS and its micro‑service‑based cloud architecture, detailing client, backend, and CDN components, the design goals of high availability, low latency, scalability, and resilience, and how these goals are achieved through services such as EC2, S3, DynamoDB, Cassandra, Zuul, Hystrix, and Open Connect.
1 Overview
For years Netflix has been the world’s best‑rated subscription video‑streaming service, consuming more than 15 % of global Internet bandwidth. In 2019 it had 167 million subscribers, adding about 5 million each quarter and serving over 200 countries/regions.
Netflix users watch over 1.65 billion hours of video daily across 4,000 movies and 47,000 TV episodes, illustrating the high availability and scalability of its video‑streaming system.
The company spent more than eight years building the current IT system. The transformation began in August 2008 after a three‑day outage of its DVD‑rental service, prompting a move from a single‑point‑of‑failure data center to a fault‑tolerant public‑cloud architecture based on micro‑services.
Netflix chose AWS because it provides reliable databases, massive cloud storage, and a global network of data centers, allowing Netflix to focus on delivering high‑quality streaming rather than managing its own data centers.
Netflix was also one of the early adopters of micro‑services, which break monolithic applications into small, independently deployable components, improving scalability, deployment speed, and fault isolation.
2 Architecture
Netflix runs on Amazon Web Services (AWS) together with its own content‑delivery network, Open Connect. The three major parts of the software architecture are the client, the backend, and the CDN.
Client : Supported browsers on laptops/desktops and native iOS/Android apps on phones, tablets, and smart TVs. Netflix provides an SDK that can control playback, adapt to network conditions, and select the optimal Open Connect Appliance (OCA) server.
Backend : Fully hosted on AWS and includes compute (EC2), storage (S3), business‑logic micro‑services, distributed databases (DynamoDB, Cassandra), big‑data processing (EMR, Hadoop, Spark, Flink, Kafka), and video transcoding tools.
Open Connect CDN : A network of OCAs deployed inside ISPs and IXPs worldwide, optimized for storing and streaming large video files.
2.1 Playback Architecture
When a user clicks Play, the client interacts with backend services and the OCA to stream video. The flow includes OCA health reporting, cache‑control service, playback request handling, validation of subscription and licensing, steering service to select the best OCA based on client IP and ISP, client probing of the returned OCA list, and finally video delivery from the chosen OCA.
2.2 Backend Architecture
The backend processes registration, login, billing, transcoding, and recommendation workloads using a micro‑service architecture. A typical request path is:
Client request reaches AWS Elastic Load Balancer (ELB).
ELB forwards the request to the API‑gateway service (Zuul), which performs routing, filtering, and security.
Zuul routes the request to the appropriate Application API (e.g., Playback API).
Playback API invokes one or more micro‑services.
Micro‑services are isolated by Hystrix commands and may cache results in EVCache.
Micro‑services read/write data from storage services (Cassandra, S3, etc.).
Events are sent to the stream‑processing pipeline for real‑time personalization or batch analytics.
Processed data are persisted to storage such as S3, Hadoop HDFS, or Cassandra.
API Gateway (Zuul) : Handles inbound, endpoint, and outbound filters, integrates with Eureka for service discovery, and supports traffic routing, load testing, and dynamic endpoint selection.
Application API : Orchestrates calls to underlying micro‑services, uses gRPC/HTTP2 for efficient communication, and employs Hystrix for resilience. It processes requests using a mix of synchronous execution and asynchronous I/O.
Micro‑services : Small, independently deployable services that communicate via REST or gRPC, may have their own data stores and in‑memory caches (e.g., EVCache).
Data Stores : Netflix uses a mix of SQL (MySQL for title management and billing) and NoSQL (Cassandra for high‑read workloads, Elasticsearch for search, Hadoop for big‑data processing).
Stream Processing Pipeline : Handles trillions of events daily, using Kafka for routing, and provides a Server‑less Platform‑as‑a‑Service (SPaaS) for engineers to build custom stream applications.
2.3 Open Connect
Open Connect is Netflix’s global CDN that stores video files on OCAs located inside ISP and IXP networks. OCAs report health and content metrics to a control‑plane service in AWS, which then steers clients to the optimal OCA. The control plane also manages cache‑fill, peer‑fill, and tier‑fill processes to propagate new video files across the OCA network.
3 Design Goals
Ensure high availability of streaming services worldwide.
Provide resilience against network failures and system outages.
Minimize streaming latency under diverse network conditions.
Support massive request volumes with horizontal scalability.
3.1 High Availability
Availability is measured by the proportion of successful responses over time. Netflix’s streaming availability depends on both backend services and OCA server health. Load balancers, Hystrix‑protected APIs, caching, and redundant OCA deployments all contribute to high availability.
3.2 Low Latency
Latency is driven by how quickly the Playback API can produce a healthy OCA list and how fast the client can connect to the selected OCA. Hystrix timeouts, cache fall‑backs, and client‑side OCA probing keep latency within acceptable bounds.
4 Trade‑offs
Consistency vs. low latency (using EVCache or eventually consistent Cassandra data).
Consistency vs. high availability (accepting slightly stale data to meet latency targets).
Scalability vs. performance (adding more instances may not linearly improve performance, mitigated by AWS Auto Scaling and the Titus container platform).
5 Resilience
Since moving to AWS, Netflix has built self‑healing mechanisms to survive failures such as service‑dependency resolution errors, micro‑service crashes, overload, or OCA connectivity issues. Zuul provides adaptive retries and concurrency limits, while Hystrix isolates failures. Netflix also practices chaos engineering by injecting random faults into production to validate detection and recovery tooling.
6 Scalability
AWS Auto Scaling automatically adds or removes EC2 instances based on load. Netflix runs millions of containers on its open‑source Titus platform, which can span multiple regions. Parallel execution in network event loops, wide‑column stores like Cassandra, and key‑value stores like Elasticsearch further enhance scalability.
7 Conclusion
The article presents a comprehensive view of Netflix’s cloud‑based streaming architecture and analyzes how its design meets the goals of high availability, low latency, scalability, and resilience to network or system failures. The architecture, validated in production, demonstrates how deep integration with AWS services and a robust micro‑service ecosystem can support millions of subscribers worldwide.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.