Operations 7 min read

Building an Event-Driven Automated Operations Platform (Whale)

Whale is an event‑driven automated operations platform that lets developers package atomic tasks, users compose workflows, and a rule‑matching engine trigger them in real time via an event center, employing a StackStorm‑based execution engine for fault‑tolerant, cross‑datacenter orchestration and future AI‑enhanced self‑healing.

Youzan Coder
Youzan Coder
Youzan Coder
Building an Event-Driven Automated Operations Platform (Whale)

As companies grow and business complexity increases, the requirements for operations automation are becoming higher. The evolution of operations can be divided into: manual operations → automated operations → DevOps → intelligent operations. This article introduces the design and implementation of an event-driven automated operations platform called "Whale".

Overall Design: The system operates in two phases: orchestration period and runtime.

Orchestration Period: Platform developers encapsulate various platform functions as atomic tasks. Users combine atomic tasks to build processing workflows, configure events to be monitored in the event center, set up event-task associations in the rule matching module, and assemble task parameters based on event content.

Runtime: When operations objects change state, events are generated and enter the event center. The rule engine receives events and generates matching results, triggering corresponding workflows. Atomic operations trigger state changes until the target state is reached, then feedback is sent to administrators.

Event Center: Acts as the event entry point, responsible for standardizing external event interfaces, validating event legality, providing user configuration capabilities, and using Elasticsearch for event storage and retrieval.

Rule Matching Module: The brain of event-driven operations, consisting of three parts: rule feature matching, task parameter assembly, and task trigger rate limiting. Feature matching supports common relationships like equals, not equals, greater than, less than, and regex matching. Parameter assembly uses DSL to create templates that render task parameters from event content. Rate limiting implements a cache-based token bucket limiter to prevent duplicate task creation.

Execution Engine: Based on StackStorm, providing a stateful, programmable, workflow-supported, fault-tolerant, and extensible distributed task orchestration framework. Users can write YAML to combine atomic tasks easily. A multi-cluster proxy is implemented for traffic isolation, supporting cross-datacenter high availability and user-transparent upgrades.

Practical Applications: Implemented automated alert handling for VM disk capacity warnings and Dmesg error cleanup through WeCom integration.

Future Prospects: Eliminate manual authorization requirements, combine alerts with pre-plans for fault self-healing products, enable ChatOps through WeCom integration, provide visual workflow builders, and integrate AI capabilities for intelligent operations.

System ArchitectureDevOpsevent-drivenAIOpsStackStormworkflow orchestrationoperations automation
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.