Inside Prometheus Alerting Rules: How They’re Managed and Executed
This article explains Prometheus' custom Rule system, detailing the structure and components of alerting rules, the rule manager's loading and updating process, group scheduling, evaluation cycles, and the logic for generating, updating, and sending alerts, enabling advanced monitoring extensions.
What is a Rule
Prometheus supports user‑defined Rule configurations. Rules are of two types: Recording Rules, which pre‑compute complex PromQL queries for faster reuse, and Alerting Rules, which define conditions that trigger alerts when evaluated.
This article focuses on the analysis of alerting rules. An alerting rule lets you specify a PromQL expression as the trigger condition; Prometheus periodically evaluates the expression and sends a notification when the condition is met.
What is an Alerting Rule
Alerting is a core feature of Prometheus. Below is a typical alert rule definition:
groups:
- name: example
rules:
- alert: HighErrorRate
# The metric must be > 0.5 for the last 10 minutes.
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
description: description infoAn alert rule file groups related rules under a group. Each rule consists of:
alert : the rule name.
expr : a PromQL expression that determines when the alert fires.
for : optional waiting period; the condition must hold for this duration before the alert is sent.
labels : custom labels attached to the alert.
annotations : additional information (e.g., description) sent to Alertmanager.
Rule Manager
The manager loads rule files, parses them into Group objects, and coordinates evaluation. A simplified manager struct:
type Manager struct {
opts *ManagerOptions // external dependencies (storage, notify, etc.)
groups map[string]*Group // current rule groups
mtx sync.RWMutex // protects groups
block chan struct{}
done chan struct{}
restored bool
logger log.Logger
}Key fields: opts: holds references to storage, notification modules, etc. groups: maps a group identifier to its Group instance. mtx: read‑write lock for concurrent access.
Loading Rule Groups
When the Prometheus server starts, Manager.Update() is called to load and parse rule files:
Calls Manager.LoadGroups() to obtain a set of Group objects.
Stops old groups and starts new ones, launching a goroutine for each group to evaluate its PromQL queries.
func (m *Manager) Update(interval time.Duration, files []string, externalLabels labels.Labels, externalURL string) error {
m.mtx.Lock()
defer m.mtx.Unlock()
groups, errs := m.LoadGroups(interval, externalLabels, externalURL, files...)
if errs != nil {
for _, e := range errs {
level.Error(m.logger).Log("msg", "loading groups failed", "err", e)
}
return errors.New("error loading rules, previous rule set restored")
}
m.restored = true
var wg sync.WaitGroup
for _, newg := range groups {
gn := GroupKey(newg.file, newg.name)
oldg, ok := m.groups[gn]
delete(m.groups, gn)
if ok && oldg.Equals(newg) {
groups[gn] = oldg
continue
}
wg.Add(1)
go func(newg *Group) {
if ok {
oldg.stop()
newg.CopyState(oldg)
}
wg.Done()
<-m.block
newg.run(m.opts.Context)
}(newg)
}
// stop remaining old groups
wg.Add(len(m.groups))
for n, oldg := range m.groups {
go func(n string, g *Group) {
g.markStale = true
g.stop()
if m := g.metrics; m != nil {
m.IterationsMissed.DeleteLabelValues(n)
m.IterationsScheduled.DeleteLabelValues(n)
m.EvalTotal.DeleteLabelValues(n)
m.EvalFailures.DeleteLabelValues(n)
m.GroupInterval.DeleteLabelValues(n)
m.GroupLastEvalTime.DeleteLabelValues(n)
m.GroupLastDuration.DeleteLabelValues(n)
m.GroupRules.DeleteLabelValues(n)
m.GroupSamples.DeleteLabelValues(n)
}
wg.Done()
}(n, oldg)
}
wg.Wait()
m.groups = groups
return nil
}Running a Rule Group
Each Group runs a loop with a ticker based on g.interval (default 1 minute, configurable via global.evaluation_interval). The loop calls g.Eval() to evaluate all rules in the group.
func (g *Group) run(ctx context.Context) {
defer close(g.terminated)
evalTimestamp := g.EvalTimestamp(time.Now().UnixNano()).Add(g.interval)
select {
case <-time.After(time.Until(evalTimestamp)):
case <-g.done:
return
}
ctx = promql.NewOriginContext(ctx, map[string]interface{}{"ruleGroup": map[string]string{"file": g.File(), "name": g.Name()}})
iter := func() {
g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Inc()
start := time.Now()
g.Eval(ctx, evalTimestamp)
g.metrics.IterationDuration.Observe(time.Since(start).Seconds())
g.setEvaluationTime(time.Since(start))
g.setLastEvaluation(start)
}
tick := time.NewTicker(g.interval)
defer tick.Stop()
// initial evaluation
iter()
for {
select {
case <-g.done:
return
case <-tick.C:
// handle missed intervals
missed := (time.Since(evalTimestamp) / g.interval) - 1
if missed > 0 {
g.metrics.IterationsMissed.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
g.metrics.IterationsScheduled.WithLabelValues(GroupKey(g.file, g.name)).Add(float64(missed))
}
evalTimestamp = evalTimestamp.Add((missed + 1) * g.interval)
iter()
}
}
}Evaluating Individual Rules
During Group.Eval(), each rule is evaluated via the provided QueryFunc. For AlertingRule instances, the resulting alerts are sent through the configured NotifyFunc. Recording rules store their results back into the TSDB.
func (g *Group) Eval(ctx context.Context, ts time.Time) {
var samplesTotal float64
for i, rule := range g.rules {
select {
case <-g.done:
return
default:
}
// evaluate rule
vector, err := rule.Eval(ctx, ts, g.opts.QueryFunc, g.opts.ExternalURL)
if err != nil {
rule.SetHealth(HealthBad)
rule.SetLastError(err)
g.metrics.EvalFailures.WithLabelValues(GroupKey(g.File(), g.Name())).Inc()
continue
}
samplesTotal += float64(len(vector))
if ar, ok := rule.(*AlertingRule); ok {
ar.sendAlerts(ctx, ts, g.opts.ResendDelay, g.interval, g.opts.NotifyFunc)
}
// handling of RecordingRule results omitted for brevity
}
if g.metrics != nil {
g.metrics.GroupSamples.WithLabelValues(GroupKey(g.File(), g.Name())).Set(samplesTotal)
}
g.cleanupStaleSeries(ctx, ts)
}AlertingRule Structure and Lifecycle
The AlertingRule struct holds the rule name, expression, hold duration, labels, annotations, and runtime state such as active alerts, evaluation timestamps, and health.
type AlertingRule struct {
name string
vector parser.Expr
holdDuration time.Duration
labels labels.Labels
annotations labels.Labels
externalLabels map[string]string
restored bool
mtx sync.Mutex
evaluationDuration time.Duration
evaluationTimestamp time.Time
health RuleHealth
lastError error
active map[uint64]*Alert
logger log.Logger
}During evaluation, the rule hashes each result’s label set to determine whether an alert already exists. New alerts are added to active, existing alerts are updated, and alerts that disappear are either marked StateInactive or removed after a timeout.
func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc, externalURL *url.URL) (promql.Vector, error) {
res, err := query(ctx, r.vector.String(), ts)
if err != nil {
r.SetHealth(HealthBad)
r.SetLastError(err)
return nil, err
}
// process result vector, update r.active map, manage state transitions
// omitted for brevity
return res, nil
}Sending Alerts
After evaluation, AlertingRule.sendAlerts iterates over active alerts and sends those that need to be notified based on their state, the configured ResendDelay, and the rule’s evaluation interval.
func (r *AlertingRule) sendAlerts(ctx context.Context, ts time.Time, resendDelay, interval time.Duration, notifyFunc NotifyFunc) {
alerts := []*Alert{}
r.ForEachActiveAlert(func(alert *Alert) {
if alert.needsSending(ts, resendDelay) {
alert.LastSentAt = ts
delta := resendDelay
if interval > resendDelay {
delta = interval
}
alert.ValidUntil = ts.Add(4 * delta)
copy := *alert
alerts = append(alerts, ©)
}
})
notifyFunc(ctx, r.vector.String(), alerts...)
}
func (a *Alert) needsSending(ts time.Time, resendDelay time.Duration) bool {
if a.State == StatePending {
return false
}
if a.ResolvedAt.After(a.LastSentAt) {
return true
}
return a.LastSentAt.Add(resendDelay).Before(ts)
}In summary, Prometheus evaluates alerting rules on a fixed interval, maintains active alert state, and dispatches notifications according to hold durations, resend delays, and state transitions. Understanding this flow enables developers to extend Prometheus, for example by loading rule groups dynamically from a database.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
