Operations 17 min read

How to Build an Automated Prometheus Inspection System with Go

This article explains how to design and implement an automated inspection platform that leverages Prometheus and Grafana for metric collection, splits inspection tasks, schedules them with cron, generates reports, sends WeChat notifications, and exports results to PDF, all using Go and the gin‑vue‑admin framework.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build an Automated Prometheus Inspection System with Go

Preface

Most companies use Prometheus + Grafana for metric monitoring, which generates a large amount of data in Prometheus. To ease daily operational checks, an automated inspection service based on Prometheus can be built to reduce manual workload.

Idea

The inspection functionality is divided into several modules for flexible management: Data source management: manage multiple Prometheus data sources and optionally add others such as ES. Inspection item management: each inspection item corresponds to a Prometheus rule, enabling reuse across clusters and data sources. Label management: combine Prometheus labels with inspection items for flexible composition. Task orchestration: define and schedule inspection tasks. Execution jobs: configure timed jobs composed of multiple orchestrated tasks. Inspection reports: view and export inspection results. Inspection notifications: send results to an enterprise WeChat group for quick awareness.

Effect

Data source management

(1) Add data source

(2) Data source list

Inspection item management

(1) Add inspection item

(2) Inspection item list

Label management

(1) Add label

(2) Label list

Task orchestration

(1) Create task orchestration

(2) Task list

Execution jobs

(1) Create execution job

(2) Job list

Inspection report

Each inspection generates a report that can be viewed in detail, exported to PDF, and optionally sent to an enterprise WeChat group.

Code Implementation

Most code follows standard CRUD patterns for managing data sources, inspection items, and labels. The core inspection flow is described below.

// CreateCronTask creates a scheduled task
func (inspectionExecutionJobService *InspectionExecutionJobService) CreateCronTask(job *AutoInspection.InspectionExecutionJob) error {
    cronName := fmt.Sprintf("InspectionExecution_%d", job.ID)
    taskName := fmt.Sprintf("InspectionExecution_%d", job.ID)
    if _, found := global.GVA_Timer.FindTask(cronName, taskName); found {
        global.GVA_Timer.Clear(cronName)
    }
    var option []cron.Option
    option = append(option, cron.WithSeconds())
    if _, err := global.GVA_Timer.AddTaskByFunc(cronName, job.CronExpr, func() {
        inspectionExecutionJobService.ExecuteInspectionJob(job)
    }, taskName, option...); err != nil {
        global.GVA_LOG.Error("创建定时任务失败", zap.Error(err), zap.Uint("jobID", job.ID))
        return err
    }
    nextTime := inspectionExecutionJobService.calculateNextRunTime(job.CronExpr)
    job.NextRunTime = &nextTime
    return global.GVA_DB.Model(job).Updates(map[string]interface{}{"next_run_time": job.NextRunTime}).Error
}

When a user creates an Execution Job and enables it, a cron task is scheduled. At execution time the ExecuteInspectionJob function runs:

func (inspectionExecutionJobService *InspectionExecutionJobService) ExecuteInspectionJob(job *AutoInspection.InspectionExecutionJob) {
    inspectionExecutionJobService.updateJobExecutionTime(job)
    jobExecution := inspectionExecutionJobService.createJobExecution(job)
    if jobExecution == nil {
        return
    }
    allResults := inspectionExecutionJobService.executeAllInspectionTasks(job, jobExecution)
    global.GVA_LOG.Info("执行完成", zap.Any("results", allResults))
    inspectionExecutionJobService.updateJobExecutionResult(jobExecution, allResults)
    if *job.IsNotice {
        inspectionExecutionJobService.sendInspectionNotification(job, jobExecution, allResults)
    }
}

The executeAllInspectionTasks function runs each inspection task concurrently using a wait group and mutex, collecting results via channels.

func (inspectionExecutionJobService *InspectionExecutionJobService) executeAllInspectionTasks(job *AutoInspection.InspectionExecutionJob, jobExecution *AutoInspection.JobExecution) []*result.ProductsResult {
    var wg sync.WaitGroup
    var mu sync.Mutex
    allResults := make([]*result.ProductsResult, 0)
    for _, jobID := range job.JobIds {
        wg.Add(1)
        go func(id uint) {
            defer wg.Done()
            result := inspectionExecutionJobService.executeSingleInspectionTask(id, jobExecution)
            if result != nil {
                mu.Lock()
                allResults = append(allResults, result)
                mu.Unlock()
            }
        }(jobID)
    }
    wg.Wait()
    return allResults
}

Each single task ultimately calls ExecuteInspectionTask, which selects the appropriate data‑source type. Currently only prometheus is supported:

func (inspectionExecutionJobService *InspectionExecutionJobService) ExecuteInspectionTask(inspectionJob *AutoInspection.InspectionJob, jobExecution *AutoInspection.JobExecution, resultCh chan *result.ProductsResult) {
    switch inspectionJob.DataSourceType {
    case "prometheus":
        inspectionExecutionJobService.ExecutePrometheusInspectionTask(inspectionJob, jobExecution, resultCh)
    }
}
ExecutePrometheusInspectionTask

builds a list of PrometheusRule objects from inspection items and labels, creates a Product that aggregates these rules, and runs them:

func (inspectionExecutionJobService *InspectionExecutionJobService) ExecutePrometheusInspectionTask(inspectionJob *AutoInspection.InspectionJob, jobExecution *AutoInspection.JobExecution, resultCh chan *result.ProductsResult) {
    // build Prometheus rules
    prometheusRules := make([]*product.PrometheusRule, 0, len(inspectionJob.ItemLabelMaps))
    for _, itemLabelMap := range inspectionJob.ItemLabelMaps {
        // fetch item and label, then construct rule
        // ... omitted for brevity ...
        prometheusRule := &product.PrometheusRule{...}
        prometheusRules = append(prometheusRules, prometheusRule)
    }
    prod := &product.Product{Name: inspectionJob.Name, Rules: product.Rules{Prometheus: prometheusRules}}
    defer func() {
        if r := recover(); r != nil {
            global.GVA_LOG.Error("执行巡检任务发生panic", zap.Any("panic", r), zap.String("jobName", inspectionJob.Name))
            // send failure result
        }
    }()
    if err := prod.Run(resultCh); err != nil {
        global.GVA_LOG.Error("执行巡检任务失败", zap.Error(err), zap.String("jobName", inspectionJob.Name))
        return
    }
    global.GVA_LOG.Info("巡检任务已启动", zap.String("jobName", inspectionJob.Name))
}

The PrometheusRule.Run method queries the Prometheus client, builds a RuleResult, and returns it.

func (r *PrometheusRule) Run() (*result.RuleResult, error) {
    ds, err := datasource.GetByName(r.DataSourceName)
    if err != nil {
        return nil, err
    }
    pds, ok := ds.(*datasource.PrometheusDataSource)
    if !ok {
        return nil, fmt.Errorf("数据源类型错误: %s 不是Prometheus数据源", r.DataSourceName)
    }
    if pds.Client == nil {
        return nil, fmt.Errorf("数据源为空: %s", r.DataSourceName)
    }
    res, err := pds.Run(r.Rule, r.LabelFilter)
    if err != nil {
        return nil, err
    }
    return r.buildRuleResult(res), nil
}

Notification is sent via the sendInspectionNotification function, which formats a markdown or text payload and posts it to an enterprise WeChat webhook.

func (inspectionExecutionJobService *InspectionExecutionJobService) sendInspectionNotification(job *AutoInspection.InspectionExecutionJob, jobExecution *AutoInspection.JobExecution, results []*result.ProductsResult) {
    // aggregate counts and abnormal items
    // build markdown or text content
    // send via sender.Sender
}

PDF export uses wkhtmltopdf to render an HTML template of the report and save the file.

func (jobExecutionService *JobExecutionService) GeneratePDF(jobExecution *AutoInspection.JobExecution) (string, error) {
    pdf, err := wkhtmltopdf.NewPDFGenerator()
    if err != nil {
        return "", err
    }
    // set options, render HTML, add page, create PDF, write to file
    return downloadURL, nil
}
Tip: The project uses the gin-vue-admin framework, so the built‑in timer component is leveraged for scheduling.
cloud nativeGoPrometheusOps AutomationAutomated Inspection
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.