Operations 7 min read

StackStorm‑Based Monitoring Alert Auto‑Remediation Solution

This article introduces a StackStorm‑driven monitoring and alert auto‑remediation architecture that converges alarms, performs root‑cause analysis, and executes self‑healing actions, detailing its components, workflow, configuration examples, and real‑world deployment outcomes.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
StackStorm‑Based Monitoring Alert Auto‑Remediation Solution

Background

Fault self‑healing is a hot topic in operations, and many companies invest heavily in developing various components. Properly orchestrating these components and avoiding duplicate development is a pressing challenge.

The 360 operations team built a monitoring‑alert auto‑remediation solution based on StackStorm.

Goal

Converge alarm events and perform self‑healing processing:

Convergence: merge alarms to reduce their number.

Self‑healing: use a predefined rule library for root‑cause analysis and automatically handle alarms when a rule matches.

Vision

Create an enterprise‑grade cloud service with a dedicated medical‑center and self‑healing center.

StackStorm Overview

StackStorm is an open‑source, event‑driven automation platform that can integrate existing workflows, platforms, APIs, and any executable environment. It treats each action as an atomic unit that can be shared across projects.

Supported scenarios include:

Fault diagnosis: capture alerts from monitoring systems like Nagios or Zabbix and combine diagnostic actions.

Automatic execution: e.g., automatically resolve OpenStack compute node hardware failures, orchestrate email notifications, etc.

CI/CD: integrate with Jenkins, AWS, and other tools for continuous deployment and integration.

StackStorm Components

Sensors: listen for external events and trigger execution.

Trigger: the concrete representation of an external event, linking sensors and rules.

Rule: maps triggers to actions or workflows, defining matching criteria.

Action: executable steps such as scripts, API calls, SSH commands, Docker, Salt, Jenkins, etc.

Workflow: an ordered collection of actions.

Pack: a collection of related content, essentially a project.

StackStorm Workflow Explanation

Sensors receive event streams via pull/push, triggers inject data into StackStorm, rules match the data against criteria, and matching rules launch the associated workflow composed of actions.

Pack Structure

Workflow Structure

Sensor Configuration (YAML) and Python Script

Example YAML configuration and corresponding Python script are shown in the original images.

Rule Example

Application Instance

The solution has been deployed to cover 70‑80% of common operational scenarios within the team, automating most daily alerts, saving personnel effort, improving efficiency, and minimizing service downtime.

Future work includes integrating business topology and call‑graph data, leveraging monitoring and log data with AI algorithms for proactive fault prediction and early warning.

Conclusion

By combining existing troubleshooting processes with StackStorm, a fully automated, self‑healing cloud service has been realized without manual intervention.

MonitoringworkflowStackStormoperations automationAlert ConvergenceAuto‑Remediation
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.