Operations 21 min read

How 360’s Doraemon Transforms Prometheus Alerting with Dynamic Rules and Aggregation

360’s Doraemon monitoring platform extends Prometheus with a rule engine, alert gateway and web UI, offering dynamic rule loading, label‑based routing, aggregation, maintenance groups, and both Docker‑Compose and Kubernetes deployments, all open‑sourced on GitHub.

dbaplus Community
dbaplus Community
dbaplus Community
How 360’s Doraemon Transforms Prometheus Alerting with Dynamic Rules and Aggregation

Background and Alertmanager Limitations

In the 360 Search Cloud Platform all Prometheus instances are containerised and deployed in a federated architecture, collecting metrics from both containers and physical machines. The built‑in Alertmanager has several drawbacks:

Alert rules cannot be loaded dynamically; configuration files must be edited.

Prometheus configuration must be modified to link Alertmanager.

No support for escalation or dynamic on‑call groups.

Label matching cannot be changed without editing files.

Doraemon Architecture

The open‑source Doraemon alerting platform (GitHub: https://github.com/Qihoo360/doraemon) solves the above problems. Its core components are:

Rule Engine : pulls alert rules from the Alert Gateway, distributes them to multiple Prometheus servers for evaluation, and forwards generated alerts back to the Alert Gateway.

Alert Gateway : receives alerts from the Rule Engine and aggregates them according to defined policies before sending.

Web UI : provides operators with interfaces to create and manage alert rules, alert plans, groups, confirm alerts, and view historical alerts.

Doraemon architecture diagram
Doraemon architecture diagram

Key Terminology

Alert rule – same concept as a Prometheus alert rule.

Data source – URL of a Prometheus server where the rule is evaluated.

Alert receiving group – static list of alert recipients.

On‑call group – dynamically fetched list of recipients.

Alert delay – time after a trigger before the alert is sent.

Alert period – interval between successive alerts for the same rule.

Alert plan – collection of one or more alert strategies.

Alert method – internal users receive SMS/phone/internal‑messaging; external users receive HTTP POST webhook.

Alert strategy – combination of delay, period, time window, groups, and method.

Alert confirmation – temporary pause of alerts.

Maintenance group – silences alerts from specified machines during defined windows.

Creating an Alert Plan

An alert plan requires only a name and description. Multiple alert strategies can be added under the plan. Example: route alerts with label idc=beijing to group op1, alerts with idc!=beijing to op2, and send alerts older than 60 minutes to a leader via phone.

Alert plan configuration
Alert plan configuration

Adding an Alert Rule

The rule‑creation form mirrors Prometheus alerting rule fields: expr – metric expression.

Threshold – value that triggers the alert. for – duration the condition must hold.

Summary – short title.

Description – detailed message.

Data source – target Prometheus server.

Strategy – associated alert plan.

Add alert rule UI
Add alert rule UI

Alert Aggregation

Alerts are aggregated at the rule level. For a rule with a 5‑minute period, all alerts triggered within each 5‑minute window are combined and sent once, reducing noise and simplifying root‑cause analysis.

Alert Confirmation

Rule‑level confirmation via a link in the alert message – confirms only alerts that satisfy the current strategy.

Label‑level bulk confirmation – confirms all alerts sharing a specific label (e.g., instance=10.0.0.1:9090).

Rule‑level confirmation
Rule‑level confirmation
Label‑level confirmation
Label‑level confirmation

Maintenance Group

Used to silence alerts from machines under maintenance. Required fields include time window, month, dates, validity period, and a list of machine identifiers (usually IPs without ports).

Maintenance group UI
Maintenance group UI

Historical Alerts

The “Historical Alerts” page displays all past alerts, including those that recovered before reaching an aggregation point.

Historical alerts view
Historical alerts view

Quick Deployment

Docker‑Compose (local testing)

Clone the repository from GitHub.

Edit deployments/docker-compose/conf/config.js to replace localhost with the host IP or domain.

Run docker-compose up -d and access the UI at http://<host>:32000.

Kubernetes (production)

Clone the repository from GitHub.

Modify deployments/kubernetes/doraemon.yml to set MySQL credentials and replace the nodeip in the doraemon-ui ConfigMap with a cluster node IP.

Apply the manifest: kubectl apply -f deployments/kubernetes/doraemon.yml. Access the UI at http://<nodeip>:32000.

Alert Recovery Aggregation Scheme

Two counters are maintained per rule: count – number of aggregation cycles elapsed. rulecount – per‑rule counter stored in a map keyed by (ruleId, start), where start is the alert delay.

Recovery messages are sent based on three cases:

If count‑start >= period, the alert has already been sent; a recovery must be sent.

If 0 <= count‑start < period, a recovery is sent when (rulecount‑(count‑start)) % period == 0.

If count‑start < 0, the alert was never sent, so no recovery is emitted.

Recovery aggregation flow
Recovery aggregation flow

Tag Matching and Filtering

Users can write filter expressions such as host!=H1 & (idc=SHYC | idc=ZZZC) | port=80. The backend validates the expression by converting it to Reverse Polish Notation (postfix). Invalid expressions return an error.

Example conversion: host!=H1 idc=SHYC idc=ZZZC | & port=80 | The conversion runs in O(n·log n) time; with additional space it can be reduced to O(n).

Tag expression parsing
Tag expression parsing

Selected Q&A (Technical Highlights)

Can aggregation be disabled? Not supported; alerts are always aggregated.

Maximum delay introduced by aggregation? With a period t, the worst‑case delay is t‑1 minutes.

Is whole‑datacenter failure collapse supported? No, alert convergence is not implemented.

Do I need to configure each metric per host? No, a single metric configuration can apply to all hosts.

Is there a UI to edit Prometheus config files? No.

How does Doraemon load dynamic rules? By modifying Prometheus’ rule manager module.

Can I set per‑container memory thresholds? Yes, by using different labels for each rule.

Is automatic fault remediation available? Not currently; custom scripts can be triggered manually.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetesopen source
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.