Building a Simple Cloud‑Native Alert Platform: Features, Architecture & Roadmap
This article describes the design and implementation of a lightweight cloud‑native alert platform, outlining its core features, future enhancements, system architecture, and demo screenshots, offering practical insights for SREs and operations teams handling growing monitoring workloads.
Introduction
Hello, I’m Joker, an operations engineer and cloud‑native enthusiast.
Why a Custom Alert Platform?
With an increasing number of teams and clusters feeding monitoring data, alerts become chaotic, historical queries are hard, and progress tracking is difficult. Existing SaaS solutions like
快猫,
睿象云, and other
SAASalert systems cannot be used due to network restrictions, prompting the development of a simple in‑house alert platform to meet daily business needs.
Current Platform Features
Alert grouping: adopts the collaboration‑space concept from
快猫for grouping alerts.
Configurable notification templates: allows teams to customize templates for different business requirements.
Multiple notification channels: currently supports Enterprise WeChat, with plans to add SMS, email, and phone.
Selectable notification strategies: supports single‑channel and multi‑channel strategies to route alerts of different severity to appropriate recipients.
Alert silencing.
Alert claiming.
On‑call scheduling management.
Planned Enhancements
Root‑cause inference: provide basic initial diagnosis for each alert to help SREs locate issues quickly.
Automatic remediation: automate handling of alerts that can be resolved without human intervention.
Additional alert sources: beyond Prometheus, integrate Zabbix, Alibaba Cloud, Tencent Cloud, etc.
Advanced dispatch strategies: enable routing based on labels, time windows, and other criteria, not just alert level.
Architecture Overview (V1.0)
The system consists of:
Management console: enables SREs to configure integrations, handle alerts, and query history.
Mobile H5 client: allows users to view, claim, and silence alerts on smartphones.
The management console’s front‑ and back‑end are built with the
gin-vue-adminframework.
Demo Screenshots
Dashboard:
Collaboration Space:
Fault List:
Notification Templates:
Notification Channels:
Notification Strategies:
On‑call Scheduling:
Alert details within a collaboration space:
Current notification strategy (by alert level):
Sample alert received by the notification endpoint:
Clicking “Unresolved Alerts” opens the H5 page showing related alert information:
The platform currently implements the core functionalities described above; some features are still incomplete or missing, and feedback from the community is welcomed.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.