How 360’s Doraemon Transforms Prometheus Alerting with Dynamic Rules and Aggregation
360’s Doraemon monitoring platform extends Prometheus with a rule engine, alert gateway and web UI, offering dynamic rule loading, label‑based routing, aggregation, maintenance groups, and both Docker‑Compose and Kubernetes deployments, all open‑sourced on GitHub.
Background and Alertmanager Limitations
In the 360 Search Cloud Platform all Prometheus instances are containerised and deployed in a federated architecture, collecting metrics from both containers and physical machines. The built‑in Alertmanager has several drawbacks:
Alert rules cannot be loaded dynamically; configuration files must be edited.
Prometheus configuration must be modified to link Alertmanager.
No support for escalation or dynamic on‑call groups.
Label matching cannot be changed without editing files.
Doraemon Architecture
The open‑source Doraemon alerting platform (GitHub: https://github.com/Qihoo360/doraemon) solves the above problems. Its core components are:
Rule Engine : pulls alert rules from the Alert Gateway, distributes them to multiple Prometheus servers for evaluation, and forwards generated alerts back to the Alert Gateway.
Alert Gateway : receives alerts from the Rule Engine and aggregates them according to defined policies before sending.
Web UI : provides operators with interfaces to create and manage alert rules, alert plans, groups, confirm alerts, and view historical alerts.
Key Terminology
Alert rule – same concept as a Prometheus alert rule.
Data source – URL of a Prometheus server where the rule is evaluated.
Alert receiving group – static list of alert recipients.
On‑call group – dynamically fetched list of recipients.
Alert delay – time after a trigger before the alert is sent.
Alert period – interval between successive alerts for the same rule.
Alert plan – collection of one or more alert strategies.
Alert method – internal users receive SMS/phone/internal‑messaging; external users receive HTTP POST webhook.
Alert strategy – combination of delay, period, time window, groups, and method.
Alert confirmation – temporary pause of alerts.
Maintenance group – silences alerts from specified machines during defined windows.
Creating an Alert Plan
An alert plan requires only a name and description. Multiple alert strategies can be added under the plan. Example: route alerts with label idc=beijing to group op1, alerts with idc!=beijing to op2, and send alerts older than 60 minutes to a leader via phone.
Adding an Alert Rule
The rule‑creation form mirrors Prometheus alerting rule fields: expr – metric expression.
Threshold – value that triggers the alert. for – duration the condition must hold.
Summary – short title.
Description – detailed message.
Data source – target Prometheus server.
Strategy – associated alert plan.
Alert Aggregation
Alerts are aggregated at the rule level. For a rule with a 5‑minute period, all alerts triggered within each 5‑minute window are combined and sent once, reducing noise and simplifying root‑cause analysis.
Alert Confirmation
Rule‑level confirmation via a link in the alert message – confirms only alerts that satisfy the current strategy.
Label‑level bulk confirmation – confirms all alerts sharing a specific label (e.g., instance=10.0.0.1:9090).
Maintenance Group
Used to silence alerts from machines under maintenance. Required fields include time window, month, dates, validity period, and a list of machine identifiers (usually IPs without ports).
Historical Alerts
The “Historical Alerts” page displays all past alerts, including those that recovered before reaching an aggregation point.
Quick Deployment
Docker‑Compose (local testing)
Clone the repository from GitHub.
Edit deployments/docker-compose/conf/config.js to replace localhost with the host IP or domain.
Run docker-compose up -d and access the UI at http://<host>:32000.
Kubernetes (production)
Clone the repository from GitHub.
Modify deployments/kubernetes/doraemon.yml to set MySQL credentials and replace the nodeip in the doraemon-ui ConfigMap with a cluster node IP.
Apply the manifest: kubectl apply -f deployments/kubernetes/doraemon.yml. Access the UI at http://<nodeip>:32000.
Alert Recovery Aggregation Scheme
Two counters are maintained per rule: count – number of aggregation cycles elapsed. rulecount – per‑rule counter stored in a map keyed by (ruleId, start), where start is the alert delay.
Recovery messages are sent based on three cases:
If count‑start >= period, the alert has already been sent; a recovery must be sent.
If 0 <= count‑start < period, a recovery is sent when (rulecount‑(count‑start)) % period == 0.
If count‑start < 0, the alert was never sent, so no recovery is emitted.
Tag Matching and Filtering
Users can write filter expressions such as host!=H1 & (idc=SHYC | idc=ZZZC) | port=80. The backend validates the expression by converting it to Reverse Polish Notation (postfix). Invalid expressions return an error.
Example conversion: host!=H1 idc=SHYC idc=ZZZC | & port=80 | The conversion runs in O(n·log n) time; with additional space it can be reduced to O(n).
Selected Q&A (Technical Highlights)
Can aggregation be disabled? Not supported; alerts are always aggregated.
Maximum delay introduced by aggregation? With a period t, the worst‑case delay is t‑1 minutes.
Is whole‑datacenter failure collapse supported? No, alert convergence is not implemented.
Do I need to configure each metric per host? No, a single metric configuration can apply to all hosts.
Is there a UI to edit Prometheus config files? No.
How does Doraemon load dynamic rules? By modifying Prometheus’ rule manager module.
Can I set per‑container memory thresholds? Yes, by using different labels for each rule.
Is automatic fault remediation available? Not currently; custom scripts can be triggered manually.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
