How Opsflow Revolutionized Youzan's DevOps Workflow Management
This article examines the evolution of Youzan's Opsflow workflow engine, detailing its architecture, components, and how it solved numerous operational challenges such as low customizability, lack of progress visibility, and fragmented approval processes, while outlining its current status and future roadmap.
Background
As Youzan’s scale grew, its DevOps operations became increasingly complex, creating a need for a more efficient coordination between development, operations, tools, and processes to free engineers from low‑efficiency, high‑intensity manual tasks.
Opsflow, the workflow engine of Youzan’s DevOps platform, evolved over two years from a simple fixed‑order script system to a highly customizable, GUI‑driven, visual, and progress‑aware engine that now supports hundreds of daily workflows such as permission requests, component provisioning, big‑data approvals, release approvals, and CI/CD pipelines.
Problems Before Opsflow
Low customizability of processes
Participants could not perceive workflow progress
Lack of visualisation leading to manual checks and errors
Limited front‑end customisation
Duplicated approval processes across applications
No support for dynamic branching
Legacy system could not handle approver leave
Insufficient participant type support
High cost of onboarding new processes
No centralized reporting for operational insight
Opsflow System Design
2.1 Architecture
Opsflow consists of five core modules:
Opsflow‑FSM: manages finite‑state machines (FSM) for each workflow.
Opsflow‑Web: wraps FSM with RESTful APIs, handles authentication, and interacts with other DevOps subsystems.
Opsflow‑Plugins: an extensible plugin system that reacts to FSM events.
Worker: distributed task executor (based on Celery) that processes script nodes.
Monitoring module: tracks task consumption.
2.1.1 Opsflow‑FSM
Each workflow is represented as an FSM. When an administrator creates a new workflow via the GUI, a new FSM instance is stored in RDS. The FSM drives the ticket lifecycle through states such as "ES Administrator Approval" and actions like "Approve", "Reject", or "Close". The FSM advances the ticket to the "End" state upon completion.
2.1.2 Opsflow‑Web
Opsflow‑Web exposes the FSM through RESTful APIs, adds permission checks, and renders actionable buttons on the front‑end based on the FSM’s possible transitions. For example, during the "New ES Request" flow, the web layer presents three buttons corresponding to the three possible transitions.
2.1.3 Opsflow‑Plugins
Plugins receive events emitted by the FSM during state transitions. Simple plugins (e.g., enterprise‑WeChat notifications, task reminders) can be added by implementing a callback interface, enabling rapid feature extensions without bloating the core engine.
2.1.4 Worker
When a workflow reaches a "script" node, Opsflow‑Web pushes a task to a message queue. Workers consume these tasks, execute the required actions (e.g., provisioning ES resources), and then trigger the FSM to continue. Workers use Celery, allowing horizontal scaling when the queue builds up.
2.1.5 Front‑end
The front‑end provides default components such as process diagrams, ticket progress, and detail views. Administrators can configure visibility and order of these components. Custom React components can be loaded dynamically (via react‑loadable) for specific workflows, receiving rich properties that contain all ticket data.
Process Diagram Rendering
Opsflow‑FSM stores the workflow as a Directed Acyclic Graph (DAG). It uses the dagre‑d3 library, which implements a rank‑based layout algorithm, to render elegant flow diagrams.
How Opsflow Addresses the Original Problems
Problems 1‑4, 9: A GUI allows administrators to configure FSM nodes and edges, instantly visualising issues and adjusting them. The new front‑end structure resolves low customizability and lack of progress visibility.
Problem 5: Consolidating most processes onto Opsflow eliminates duplicated effort across applications.
Problem 6: Conditional expressions enable dynamic branching. Example expression: {row_count} >= 1000000 and not {upload} During execution, placeholders like row_count and upload are replaced with ticket attributes, allowing the FSM to decide the next node based on runtime data.
Problem 7: Integration with internal OA automatically substitutes approvers with their leave proxies.
Problem 8: Opsflow supports a wide range of participant types, including configurable users, logical AND/OR groups, team leaders, custom scripts (Python), leave proxies, internal app approvals, external system notifications, enterprise‑WeChat alerts, and escalation reminders.
Problem 10: Periodic statistical jobs generate dashboards that give administrators a clear view of each workflow’s operational health.
Current Status
After more than a year of iteration, Opsflow now supports over 90 distinct workflows covering all aspects of DevOps, big‑data platforms, and even the beauty‑industry department. The system has markedly improved usability, functionality, extensibility, and stability.
Roadmap
Future plans include cross‑environment workflow synchronization (e.g., QA ↔︎ Prod), an open management console for developers to create and edit workflows, workflow cloning, a more user‑friendly mobile approval experience, and additional features to further accelerate new process onboarding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
