How Netflix’s Maestro Engine Gained a 100× Speed Boost with a New Actor‑Based Architecture
Netflix’s Maestro workflow orchestrator was redesigned with a lightweight, stateful actor model and Java virtual threads, cutting engine overhead from seconds to milliseconds, delivering a hundred‑fold performance increase while preserving scalability, reliability, and strong execution guarantees for massive data and ML pipelines.
Introduction
Maestro is Netflix’s horizontally scalable workflow orchestrator for large‑scale data and machine‑learning pipelines. After more than two years of production use, users reported a ~10 second engine overhead per step, limiting productivity.
Previous Architecture
The original system consisted of three layers:
API & step runtime layer – integrated with services such as Spark and Trino, providing authentication, monitoring and alerting with minimal overhead.
Maestro engine layer – managed workflow and step lifecycles, converted workflow graphs into parallel flow tasks, and acted as a bridge to the internal flow engine. This layer suffered from race conditions and high latency due to weak guarantees in the internal flow engine and distributed job queues.
Internal flow engine layer – based on Netflix OSS Conductor 2.x, required separate database tables and job queues, introducing seconds‑to‑tens‑of‑seconds latency and lacking strong consistency.
Design Options Considered
Option 1 – Build a custom internal flow engine optimized for Maestro’s use cases.
Option 2 – Upgrade Conductor to version 4.0, which resolves many overhead issues.
Option 3 – Replace the engine with Temporal.
Option 2 required porting a removed final‑callback feature, and Option 3 added unnecessary complexity. The team selected Option 1 as the best trade‑off.
Chosen Solution: Custom Flow Engine
The new engine is lightweight, has minimal dependencies, and focuses on two core Maestro functions: state management and flow execution. It replaces external distributed job queues with an internal in‑memory Java BlockingQueue backed by a database‑supported outbox pattern, providing exactly‑once publishing and at‑least‑once delivery guarantees.
Key guarantees:
Only one worker executes a step at any time.
Step state never rolls back.
Steps always reach a terminal state.
Internal flow state eventually matches Maestro workflow state.
External API actions cannot cause race conditions.
Flow Partitioning and Ownership
Maestro groups flows into “flow groups”. Each group has an owner actor that maintains ownership metadata, allowing nodes to claim groups, rebalance after failures, and coordinate via a simple stable‑ID partition function rather than complex consistent hashing.
Actor Model and Virtual Threads
The engine adopts a stateful actor model where each workflow’s tasks are collocated on a single JVM. Java 21 virtual threads implement actors with minimal code, providing lightweight concurrency without risking out‑of‑memory errors. A separate non‑virtual worker thread pool executes user‑provided step logic to avoid deadlocks.
Strong Execution Guarantees
Maestro uses a Generation ID mechanism: each flow or task is associated with a generation ID that must match between the database and the in‑memory actor. Mismatches cause the engine to reject execution, ensuring a single actor processes a flow at any time and that state never regresses.
Reference implementation (path):
maestro-flow/src/main/java/com/netflix/maestro/flow/dao/MaestroFlowDao.javaTesting and Validation
A custom test framework samples real production workflows, caches definitions in S3, and replays them with no‑op placeholders for sub‑workflows. It validates parameter handling, step ordering, and failure detection, generating detailed failure reports via email.
Deployment Strategy
The rollout uses a parallel infrastructure behind an orchestrator gateway API. Feature flags gradually shift traffic to the new engine; if failures occur, workflows can be rolled back by removing them from the new engine’s database and restarting on the legacy stack.
Results
Over 60 000 active workflows (more than one million daily tasks) were migrated with near‑zero user impact. Engine overhead dropped from ~5 seconds to ~50 ms per step, and workflow start overhead fell from 200 ms to 50 ms. This translates to roughly 57 days of saved engine time per day, higher throughput, and a more responsive user experience.
Conclusion
The architectural evolution demonstrates that a stateful actor model combined with Java virtual threads and a simplified dependency graph can deliver a hundred‑fold performance boost while preserving scalability and strong execution guarantees, enabling low‑latency workflow use cases for large‑scale data platforms.
GitHub repository: https://github.com/Netflix/maestro
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
