Backend Development 21 min read

Applying Netflix Conductor for Scalable Workflow Orchestration in a Logistics System

This article details how a logistics company's platform adopted Netflix Conductor to unify and scale algorithmic task scheduling, addressing challenges such as fragmented workflows, resource imbalance, and long-running jobs by implementing dynamic DAGs, fork‑join patterns, and robust retry mechanisms, resulting in significant performance gains.

Beijing SF i-TECH City Technology Team

Feb 7, 2025

Applying Netflix Conductor for Scalable Workflow Orchestration in a Logistics System

1. Background

The logistics platform (ZhiYu system) needed to manage dynamic and static courier scheduling, intelligent routing, and large‑scale daily data processing for order and site metrics. Existing batch jobs often ran for many hours, and failures caused data gaps that impacted downstream indicators.

1.1 Current Situation

Daily ETL tasks process massive order and site data to compute metrics such as order fulfillment rate and timeout rate, which are then used for courier optimization and regional adjustments.

1.2 Pain Points

Inconsistent scheduling management across business lines.

Repeated development of task management, scheduling, and retry logic.

Complex task dependencies requiring extensive coordination.

Severe CPU imbalance and occasional cluster overload.

Limited parallelism due to single‑machine resource constraints.

1.3 Solution Overview

A unified workflow orchestration platform was required to provide visual management, low entry barrier, high fault tolerance, and support for fork‑join patterns.

2. Introduction to Conductor

2.1 Selection

Four common workflow engines were compared. Only Conductor supported real‑time scheduling with horizontal scalability; the others were suited for batch big‑data processing.

2.2 Core Concepts

Conductor is a Netflix‑open‑source micro‑service orchestration engine that defines workflows via JSON DSL, executes tasks via workers, and maintains state through a Decider service and a distributed KV‑store queue.

2.3 Execution Model

Workflows are described in JSON; workers pull tasks, execute them, and report status back. The Decider matches the current state with the workflow definition to decide the next step, supporting pause, resume, and restart operations.

2.4 Core Architecture

The engine relies on a DAG‑based state machine (Decider) and a distributed queue service. Task status updates trigger the Decider to transition workflow states.

2.5 Task Management

Conductor handles task retries based on defined retry counts, supports task time‑outs, and records task lifecycles in the queue.

3. Integrating Conductor with the ZhiYu System

3.1 Abstract Business Model

Algorithmic models are broken into independent tasks, each registered as a Conductor task and assembled into a workflow. The workflow is launched via API, and workers execute tasks in parallel.

3.2 Workflow Injection

Long‑running model tasks are off‑loaded to dedicated worker containers, enabling horizontal scaling and improved parallelism.

3.3 Encountered Issues

Idempotency for long‑running tasks required Redis‑based deduplication.

Complex fork‑join configurations needed explicit definition in Conductor.

Notification of business systems on workflow completion required implementing the WorkflowStatusListener interface.

4. Handling Large‑Scale CT Tasks with Conductor

4.1 Scenario Analysis

Processing nationwide courier, order, and site data demands massive parallelism. Traditional single‑node CT tasks were slow and fragile.

4.2 Business Improvements

Adopting Conductor’s dynamic DAG allowed fan‑out/fan‑in task decomposition, turning a 2‑hour job into a 10‑minute job with higher CPU utilization across nodes.

4.3 Implementation Steps

Introduce the Conductor SDK.

Configure service endpoints.

Extend provided base classes (e.g., SimpleSequenceRichShardProcessor) to implement business logic.

4.4 Business Benefits

Processing time reduced from hours to minutes; single‑node CPU usage rose to 65% with a 4% increase in multi‑node utilization. The system gained better fault tolerance and scalability.

5. Precautions

System complexity increases, requiring deeper understanding of Conductor’s concurrency, pull frequency, and task state management.

Tracing CT tasks across the cluster demands log‑id and trace‑id correlation.

Metadata resides in Redis; loss of Redis data requires re‑registration of workflows on service restart.

6. Future Plans

Complete monitoring of workflow usage, error rates, pull frequencies, and execution times.

Enhance alarm integration to provide richer context for failures.

7. Conclusion

Integrating Conductor unified workflow design across business lines, enabled reusable algorithmic tasks, isolated algorithm execution from platform services, and dramatically shortened CT task latency, allowing early‑day data availability and improving overall system robustness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development Task scheduling logistics Workflow Orchestration Conductor

Written by

Beijing SF i-TECH City Technology Team

Official tech channel of Beijing SF i-TECH City. A publishing platform for technology innovation, practical implementation, and frontier tech exploration.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.