Operations 16 min read

How We Rebuilt Our Monitoring System into a Scalable Alert Service

After two months of intensive development, the team launched a new monitoring and alerting platform that transforms a legacy system into a service‑oriented solution, addressing pain points such as inflexible escalation, noisy alerts, and poor ownership while introducing phone alerts, automated escalation, Prometheus integration, and a unified rule engine.

HaoDF Tech Team

Jul 8, 2020

How We Rebuilt Our Monitoring System into a Scalable Alert Service

Background and Motivation

The original monitoring setup, built on Zabbix and custom scripts, could not keep up with the rapid growth of the Good Doctor Online platform from 2006 to 2020. Frequent false alarms, manual 24/7 on‑call duties, lack of escalation paths, and high maintenance costs prompted a complete redesign.

Why Not Use Existing Open‑Source Solutions?

Missing automated escalation workflow (system owner → technical lead → CTO/CEO).

Alert routing was static, sending notifications only to a fixed ops group instead of empowering developers and leads.

Open‑source rule engines lacked the abstraction needed for silencing, throttling, and multi‑channel notifications.

The goal was to evolve from a "system" to a "service" that could be extended with custom features.

Key Problems Identified

Manual phone calls and SMS on‑call caused fatigue and delayed response.

Alert ownership was unclear, leading to missed notifications and prolonged outages.

Alert rules were hard to maintain and required frequent changes by a single team.

No quantitative analysis of which teams generated the most alerts.

Design of the New Monitoring Service (Dolphin)

The team defined a five‑layer monitoring model:

Client monitoring – user behavior, app version, OS, network.

Business layer – login, registration, order, payment metrics.

Application layer – request counts, SQL results, cache hit rate, QPS.

System layer – CPU, memory, disk usage of hosts.

Network layer – gateway traffic, packet loss, connection counts.

Initially the focus is on client, business, and application layers, with plans to extend to system and network layers.

Core Features Implemented

Phone‑call escalation : A multi‑step call chain guarantees that on‑call engineers are awakened even during off‑hours.

Automated escalation : If an incident is not resolved within 15 minutes, the technical lead is notified; after 30 minutes, the CTO receives the alert.

Rule‑to‑team mapping : Abstract concepts such as users, teams, applications, and monitoring items allow dynamic assignment of alerts to responsible owners.

Prometheus‑based data pipeline : Logs are collected into Elasticsearch, then scraped into Prometheus. PromQL queries evaluate predefined rules, and alerts are sent via multiple channels.

Enterprise WeChat robot : Alerts are pushed to technical groups with @mentions, providing a collaborative response channel.

Silence/quiet period : Configurable silence windows prevent alert storms, and the system automatically releases silence after a timeout to avoid forgotten silences.

Unified rule engine : Business teams can create, modify, or delete alert rules directly, reducing dependency on a dedicated ops team.

Historical alert log : All alert events are stored for easy querying and post‑mortem analysis.

Application overview portal : A single entry point links to documentation, Confluence pages, and other resources for each monitored service.

HR/OPS integration : Real‑time sync of employee status ensures alert ownership stays up‑to‑date.

Architecture Overview

The overall architecture diagram (see image) shows data collection, Prometheus storage, rule evaluation, and multi‑channel notification components.

Additional diagrams illustrate rule management, silence handling, and robot integration.

Challenges After Launch

Missing alert silencing for non‑critical alerts required a quick addition of a configurable quiet period.

Alert reachability varied across applications; a generic rule set for common middleware (MQ, Redis, MySQL) was introduced to improve coverage.

Further work is needed to enrich alert payloads with root‑cause information, such as Sentry screenshots or system‑portrait pages.

Future Plans

Analyze alert trends to detect degrading services.

Integrate with container orchestration (P8s, Kubernetes) for deeper visibility.

Enhance the WeChat robot with interactive capabilities.

Improve high‑availability and scalability as alert volume grows.

Open‑source documentation and components where possible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Automation System Design devops Alerting Prometheus escalation

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.