Backend Development 28 min read

How Netflix’s Maestro Engine Gained a 100× Speed Boost with a New Actor‑Based Architecture

Netflix’s Maestro workflow orchestrator was redesigned with a lightweight, stateful actor model and Java virtual threads, cutting engine overhead from seconds to milliseconds, delivering a hundred‑fold performance increase while preserving scalability, reliability, and strong execution guarantees for massive data and ML pipelines.

DevOps Coach

Oct 31, 2025

How Netflix’s Maestro Engine Gained a 100× Speed Boost with a New Actor‑Based Architecture

Introduction

Maestro is Netflix’s horizontally scalable workflow orchestrator for large‑scale data and machine‑learning pipelines. After more than two years of production use, users reported a ~10 second engine overhead per step, limiting productivity.

Previous Architecture

The original system consisted of three layers:

API & step runtime layer – integrated with services such as Spark and Trino, providing authentication, monitoring and alerting with minimal overhead.

Maestro engine layer – managed workflow and step lifecycles, converted workflow graphs into parallel flow tasks, and acted as a bridge to the internal flow engine. This layer suffered from race conditions and high latency due to weak guarantees in the internal flow engine and distributed job queues.

Internal flow engine layer – based on Netflix OSS Conductor 2.x, required separate database tables and job queues, introducing seconds‑to‑tens‑of‑seconds latency and lacking strong consistency.

Design Options Considered

Option 1 – Build a custom internal flow engine optimized for Maestro’s use cases.

Option 2 – Upgrade Conductor to version 4.0, which resolves many overhead issues.

Option 3 – Replace the engine with Temporal.

Option 2 required porting a removed final‑callback feature, and Option 3 added unnecessary complexity. The team selected Option 1 as the best trade‑off.

Chosen Solution: Custom Flow Engine

The new engine is lightweight, has minimal dependencies, and focuses on two core Maestro functions: state management and flow execution. It replaces external distributed job queues with an internal in‑memory Java BlockingQueue backed by a database‑supported outbox pattern, providing exactly‑once publishing and at‑least‑once delivery guarantees.

Key guarantees:

Only one worker executes a step at any time.

Step state never rolls back.

Steps always reach a terminal state.

Internal flow state eventually matches Maestro workflow state.

External API actions cannot cause race conditions.

Flow Partitioning and Ownership

Maestro groups flows into “flow groups”. Each group has an owner actor that maintains ownership metadata, allowing nodes to claim groups, rebalance after failures, and coordinate via a simple stable‑ID partition function rather than complex consistent hashing.

Actor Model and Virtual Threads

The engine adopts a stateful actor model where each workflow’s tasks are collocated on a single JVM. Java 21 virtual threads implement actors with minimal code, providing lightweight concurrency without risking out‑of‑memory errors. A separate non‑virtual worker thread pool executes user‑provided step logic to avoid deadlocks.

Strong Execution Guarantees

Maestro uses a Generation ID mechanism: each flow or task is associated with a generation ID that must match between the database and the in‑memory actor. Mismatches cause the engine to reject execution, ensuring a single actor processes a flow at any time and that state never regresses.

Reference implementation (path):

maestro-flow/src/main/java/com/netflix/maestro/flow/dao/MaestroFlowDao.java

Testing and Validation

A custom test framework samples real production workflows, caches definitions in S3, and replays them with no‑op placeholders for sub‑workflows. It validates parameter handling, step ordering, and failure detection, generating detailed failure reports via email.

Deployment Strategy

The rollout uses a parallel infrastructure behind an orchestrator gateway API. Feature flags gradually shift traffic to the new engine; if failures occur, workflows can be rolled back by removing them from the new engine’s database and restarting on the legacy stack.

Results

Over 60 000 active workflows (more than one million daily tasks) were migrated with near‑zero user impact. Engine overhead dropped from ~5 seconds to ~50 ms per step, and workflow start overhead fell from 200 ms to 50 ms. This translates to roughly 57 days of saved engine time per day, higher throughput, and a more responsive user experience.

Conclusion

The architectural evolution demonstrates that a stateful actor model combined with Java virtual threads and a simplified dependency graph can deliver a hundred‑fold performance boost while preserving scalability and strong execution guarantees, enabling low‑latency workflow use cases for large‑scale data platforms.

GitHub repository: https://github.com/Netflix/maestro

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Performance Optimization Workflow Orchestration actor-model Java virtual threads Netflix Maestro

Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.