Netflix Cosmos: A Cloud‑Native Platform Combining Microservices, Workflows, and Serverless
Netflix’s Cosmos platform, introduced in 2021, unifies microservice principles with asynchronous workflows and serverless execution to handle resource‑intensive media processing, offering observability, modularity, productivity tools, and a managed delivery pipeline that scales from tens of thousands to millions of CPU cores.
1 Introduction
Cosmos is a compute platform that blends the best traits of microservices with asynchronous workflows and serverless execution, targeting resource‑intensive algorithms that run for minutes to years and can consume tens of thousands of CPUs.
2 Background
Netflix’s Media Cloud Engineering team operated a legacy system (Reloaded) that grew from a small, single‑purpose service into a monolith handling many use cases, making feature delivery slow and operations cumbersome.
To address these problems they built Cosmos, a workflow‑driven, media‑centric microservice platform that provides observability, modularity, developer productivity tools, and a fully managed continuous delivery pipeline.
3 Overview
Cosmos services are not pure microservices but share many characteristics: a strong contract, isolated data and binaries, and automatic scaling. They add multi‑step workflows and compute‑intensive, asynchronous serverless functions.
4 Separation of Concerns
Cosmos separates concerns along two axes: API ↔ workflow ↔ serverless functions, and platform ↔ application. The platform API offers media‑specific abstractions while hiding distributed‑computing details.
Optimus – API layer mapping external requests to internal business models.
Plato – workflow layer for business‑rule modeling.
Stratum – serverless layer for stateless, compute‑intensive functions.
5 Cosmos Service Request Flow
A typical request (e.g., video encoding) passes through an API call, a set of parallel encoding functions, assembly and indexing functions, and completes after several minutes, as shown in the Nirvana monitoring UI.
6 Service Layering
Cosmos enables modular service decomposition, allowing teams to own APIs and release cycles. High‑level services (e.g., Tapas, Sagan) orchestrate lower‑level services and many Stratum function calls.
7 Workflow Rules
Plato provides a forward‑link rule engine where developers write rules in Emirax (a Groovy‑based DSL). Each rule has match, action, reaction, and error sections, enabling automatic orchestration of serverless functions.
8 Latency‑Sensitive Applications
Services like Sagan require low latency; Stratum manages delay through resource pools, warm capacity, micro‑batches, and priority scheduling.
9 Throughput‑Sensitive Applications
Services like Tapas consume massive CPU hours and are optimized for high throughput over short‑term latency, leveraging opportunistic resources on the Titus container platform.
10 Strangler Fig Migration
The team adopted the Strangler Fig pattern to gradually replace the monolithic Reloaded system with Cosmos, reducing risk by surrounding the old system before fully retiring it.
11 Learnings and Culture
Since 2018, Cosmos has grown to ~40 services. Key lessons include the importance of a platform mindset, strong engineering culture, modularity, observability, and the need for better local development, resilience, and testability.
12 Future Plans
In 2021, Netflix plans to migrate most workloads to Cosmos, improve the programming model, and continue enhancing usability, resilience, speed, and efficiency.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.