Cloud Computing 14 min read

How Netflix Built the Cosmos Platform to Power Massive Media Workflows

The article explains why Netflix created the Cosmos platform, how it combines microservices, asynchronous workflows, and serverless computing to handle resource‑intensive media processing at scale, and shares the architectural decisions, components, and lessons learned from its development and operation.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
How Netflix Built the Cosmos Platform to Power Massive Media Workflows

Introduction

Cosmos is a compute platform that blends the best traits of microservices, asynchronous workflows, and serverless to support resource‑intensive, long‑running algorithms that can span minutes to years, handling both high‑throughput and latency‑sensitive workloads.

Background

Netflix’s Media Cloud Engineering team operated a system that evolved from a 2007 streaming service to a third‑generation platform called Reloaded, which became difficult to scale and maintain as the team and use‑cases grew, leading to a monolithic architecture that hindered productivity.

To address these challenges, Netflix built Cosmos, a media‑centric, workflow‑driven microservice platform with goals of observability, modularity, developer productivity, and automated delivery.

Overview

Cosmos services retain strong contracts and isolated data like microservices but add multi‑step workflows and compute‑heavy, asynchronous serverless functions packaged as Docker images. Requests flow through an API layer (Optimus), a workflow engine (Plato), and a serverless layer (Stratum), scaling across thousands of containers.

Cosmos service
Cosmos service

Separation of Concerns

Cosmos separates logic into API, workflow, and serverless functions, and also separates application concerns from platform concerns. The platform API abstracts media‑specific details while hiding distributed computing complexities.

Key subsystems:

Optimus – API layer mapping external requests to internal models.

Plato – Workflow layer for business rule modeling.

Stratum – Serverless layer for stateless, compute‑intensive functions.

All subsystems communicate asynchronously via Timestone, a large‑scale low‑latency priority queue, enabling independent deployment through managed continuous delivery.

Workflow Rules

Developers define workflows using Emirax, a Groovy‑based DSL, with four parts: match, action, reaction, and error. Rules trigger Stratum functions, record state changes, and handle errors.

Latency‑Sensitive Applications

Services like Sagan require low latency; Stratum manages delay by using resource pools, warm capacity, micro‑batches, and priority scheduling to balance cost and responsiveness.

Throughput‑Sensitive Applications

Services such as Tapas prioritize high throughput, consuming millions of CPU hours and focusing on daily task volume rather than individual request latency. Stratum’s serverless layer on the Titus container platform enables opportunistic resource scheduling.

Strangler Fig Migration

To replace the legacy Reloaded system, Netflix adopted the Strangler Fig pattern, gradually extending new functionality around the old system until full migration.

Lessons Learned

Key insights include the importance of Netflix’s engineering culture, the power of a microservice + workflow + serverless model, and the need for a platform mindset that balances flexibility for application teams with consistency and reliability from the platform team.

Future Plans

In 2021, Netflix will continue moving workloads to Cosmos, improving the programming model, and enhancing usability, resilience, speed, and efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ServerlessMicroservicesworkflowmedia processingCosmosNetflix
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.