How Netflix Built the Cosmos Platform to Power Massive Media Workflows
The article explains why Netflix created the Cosmos platform, how it combines microservices, asynchronous workflows, and serverless computing to handle resource‑intensive media processing at scale, and shares the architectural decisions, components, and lessons learned from its development and operation.
Introduction
Cosmos is a compute platform that blends the best traits of microservices, asynchronous workflows, and serverless to support resource‑intensive, long‑running algorithms that can span minutes to years, handling both high‑throughput and latency‑sensitive workloads.
Background
Netflix’s Media Cloud Engineering team operated a system that evolved from a 2007 streaming service to a third‑generation platform called Reloaded, which became difficult to scale and maintain as the team and use‑cases grew, leading to a monolithic architecture that hindered productivity.
To address these challenges, Netflix built Cosmos, a media‑centric, workflow‑driven microservice platform with goals of observability, modularity, developer productivity, and automated delivery.
Overview
Cosmos services retain strong contracts and isolated data like microservices but add multi‑step workflows and compute‑heavy, asynchronous serverless functions packaged as Docker images. Requests flow through an API layer (Optimus), a workflow engine (Plato), and a serverless layer (Stratum), scaling across thousands of containers.
Separation of Concerns
Cosmos separates logic into API, workflow, and serverless functions, and also separates application concerns from platform concerns. The platform API abstracts media‑specific details while hiding distributed computing complexities.
Key subsystems:
Optimus – API layer mapping external requests to internal models.
Plato – Workflow layer for business rule modeling.
Stratum – Serverless layer for stateless, compute‑intensive functions.
All subsystems communicate asynchronously via Timestone, a large‑scale low‑latency priority queue, enabling independent deployment through managed continuous delivery.
Workflow Rules
Developers define workflows using Emirax, a Groovy‑based DSL, with four parts: match, action, reaction, and error. Rules trigger Stratum functions, record state changes, and handle errors.
Latency‑Sensitive Applications
Services like Sagan require low latency; Stratum manages delay by using resource pools, warm capacity, micro‑batches, and priority scheduling to balance cost and responsiveness.
Throughput‑Sensitive Applications
Services such as Tapas prioritize high throughput, consuming millions of CPU hours and focusing on daily task volume rather than individual request latency. Stratum’s serverless layer on the Titus container platform enables opportunistic resource scheduling.
Strangler Fig Migration
To replace the legacy Reloaded system, Netflix adopted the Strangler Fig pattern, gradually extending new functionality around the old system until full migration.
Lessons Learned
Key insights include the importance of Netflix’s engineering culture, the power of a microservice + workflow + serverless model, and the need for a platform mindset that balances flexibility for application teams with consistency and reliability from the platform team.
Future Plans
In 2021, Netflix will continue moving workloads to Cosmos, improving the programming model, and enhancing usability, resilience, speed, and efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
