Cloud Native 16 min read

Netflix Cosmos: A Cloud‑Native Platform Combining Microservices, Workflows, and Serverless

Netflix’s Cosmos platform, introduced in 2021, unifies microservice principles with asynchronous workflows and serverless execution to handle resource‑intensive media processing, offering observability, modularity, productivity tools, and a managed delivery pipeline that scales from tens of thousands to millions of CPU cores.

Top Architect
Top Architect
Top Architect
Netflix Cosmos: A Cloud‑Native Platform Combining Microservices, Workflows, and Serverless

1 Introduction

Cosmos is a compute platform that blends the best traits of microservices with asynchronous workflows and serverless execution, targeting resource‑intensive algorithms that run for minutes to years and can consume tens of thousands of CPUs.

2 Background

Netflix’s Media Cloud Engineering team operated a legacy system (Reloaded) that grew from a small, single‑purpose service into a monolith handling many use cases, making feature delivery slow and operations cumbersome.

To address these problems they built Cosmos, a workflow‑driven, media‑centric microservice platform that provides observability, modularity, developer productivity tools, and a fully managed continuous delivery pipeline.

3 Overview

Cosmos services are not pure microservices but share many characteristics: a strong contract, isolated data and binaries, and automatic scaling. They add multi‑step workflows and compute‑intensive, asynchronous serverless functions.

4 Separation of Concerns

Cosmos separates concerns along two axes: API ↔ workflow ↔ serverless functions, and platform ↔ application. The platform API offers media‑specific abstractions while hiding distributed‑computing details.

Optimus – API layer mapping external requests to internal business models.

Plato – workflow layer for business‑rule modeling.

Stratum – serverless layer for stateless, compute‑intensive functions.

5 Cosmos Service Request Flow

A typical request (e.g., video encoding) passes through an API call, a set of parallel encoding functions, assembly and indexing functions, and completes after several minutes, as shown in the Nirvana monitoring UI.

6 Service Layering

Cosmos enables modular service decomposition, allowing teams to own APIs and release cycles. High‑level services (e.g., Tapas, Sagan) orchestrate lower‑level services and many Stratum function calls.

7 Workflow Rules

Plato provides a forward‑link rule engine where developers write rules in Emirax (a Groovy‑based DSL). Each rule has match, action, reaction, and error sections, enabling automatic orchestration of serverless functions.

8 Latency‑Sensitive Applications

Services like Sagan require low latency; Stratum manages delay through resource pools, warm capacity, micro‑batches, and priority scheduling.

9 Throughput‑Sensitive Applications

Services like Tapas consume massive CPU hours and are optimized for high throughput over short‑term latency, leveraging opportunistic resources on the Titus container platform.

10 Strangler Fig Migration

The team adopted the Strangler Fig pattern to gradually replace the monolithic Reloaded system with Cosmos, reducing risk by surrounding the old system before fully retiring it.

11 Learnings and Culture

Since 2018, Cosmos has grown to ~40 services. Key lessons include the importance of a platform mindset, strong engineering culture, modularity, observability, and the need for better local development, resilience, and testability.

12 Future Plans

In 2021, Netflix plans to migrate most workloads to Cosmos, improve the programming model, and continue enhancing usability, resilience, speed, and efficiency.

cloud-nativeserverlessmicroservicesworkflowmedia processing
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.