How to Build an Automated Prometheus Inspection System with Go
This article explains how to design and implement an automated inspection platform that leverages Prometheus and Grafana for metric collection, splits inspection tasks, schedules them with cron, generates reports, sends WeChat notifications, and exports results to PDF, all using Go and the gin‑vue‑admin framework.
Preface
Most companies use Prometheus + Grafana for metric monitoring, which generates a large amount of data in Prometheus. To ease daily operational checks, an automated inspection service based on Prometheus can be built to reduce manual workload.
Idea
The inspection functionality is divided into several modules for flexible management: Data source management: manage multiple Prometheus data sources and optionally add others such as ES. Inspection item management: each inspection item corresponds to a Prometheus rule, enabling reuse across clusters and data sources. Label management: combine Prometheus labels with inspection items for flexible composition. Task orchestration: define and schedule inspection tasks. Execution jobs: configure timed jobs composed of multiple orchestrated tasks. Inspection reports: view and export inspection results. Inspection notifications: send results to an enterprise WeChat group for quick awareness.
Effect
Data source management
(1) Add data source
(2) Data source list
Inspection item management
(1) Add inspection item
(2) Inspection item list
Label management
(1) Add label
(2) Label list
Task orchestration
(1) Create task orchestration
(2) Task list
Execution jobs
(1) Create execution job
(2) Job list
Inspection report
Each inspection generates a report that can be viewed in detail, exported to PDF, and optionally sent to an enterprise WeChat group.
Code Implementation
Most code follows standard CRUD patterns for managing data sources, inspection items, and labels. The core inspection flow is described below.
// CreateCronTask creates a scheduled task
func (inspectionExecutionJobService *InspectionExecutionJobService) CreateCronTask(job *AutoInspection.InspectionExecutionJob) error {
cronName := fmt.Sprintf("InspectionExecution_%d", job.ID)
taskName := fmt.Sprintf("InspectionExecution_%d", job.ID)
if _, found := global.GVA_Timer.FindTask(cronName, taskName); found {
global.GVA_Timer.Clear(cronName)
}
var option []cron.Option
option = append(option, cron.WithSeconds())
if _, err := global.GVA_Timer.AddTaskByFunc(cronName, job.CronExpr, func() {
inspectionExecutionJobService.ExecuteInspectionJob(job)
}, taskName, option...); err != nil {
global.GVA_LOG.Error("创建定时任务失败", zap.Error(err), zap.Uint("jobID", job.ID))
return err
}
nextTime := inspectionExecutionJobService.calculateNextRunTime(job.CronExpr)
job.NextRunTime = &nextTime
return global.GVA_DB.Model(job).Updates(map[string]interface{}{"next_run_time": job.NextRunTime}).Error
}When a user creates an Execution Job and enables it, a cron task is scheduled. At execution time the ExecuteInspectionJob function runs:
func (inspectionExecutionJobService *InspectionExecutionJobService) ExecuteInspectionJob(job *AutoInspection.InspectionExecutionJob) {
inspectionExecutionJobService.updateJobExecutionTime(job)
jobExecution := inspectionExecutionJobService.createJobExecution(job)
if jobExecution == nil {
return
}
allResults := inspectionExecutionJobService.executeAllInspectionTasks(job, jobExecution)
global.GVA_LOG.Info("执行完成", zap.Any("results", allResults))
inspectionExecutionJobService.updateJobExecutionResult(jobExecution, allResults)
if *job.IsNotice {
inspectionExecutionJobService.sendInspectionNotification(job, jobExecution, allResults)
}
}The executeAllInspectionTasks function runs each inspection task concurrently using a wait group and mutex, collecting results via channels.
func (inspectionExecutionJobService *InspectionExecutionJobService) executeAllInspectionTasks(job *AutoInspection.InspectionExecutionJob, jobExecution *AutoInspection.JobExecution) []*result.ProductsResult {
var wg sync.WaitGroup
var mu sync.Mutex
allResults := make([]*result.ProductsResult, 0)
for _, jobID := range job.JobIds {
wg.Add(1)
go func(id uint) {
defer wg.Done()
result := inspectionExecutionJobService.executeSingleInspectionTask(id, jobExecution)
if result != nil {
mu.Lock()
allResults = append(allResults, result)
mu.Unlock()
}
}(jobID)
}
wg.Wait()
return allResults
}Each single task ultimately calls ExecuteInspectionTask, which selects the appropriate data‑source type. Currently only prometheus is supported:
func (inspectionExecutionJobService *InspectionExecutionJobService) ExecuteInspectionTask(inspectionJob *AutoInspection.InspectionJob, jobExecution *AutoInspection.JobExecution, resultCh chan *result.ProductsResult) {
switch inspectionJob.DataSourceType {
case "prometheus":
inspectionExecutionJobService.ExecutePrometheusInspectionTask(inspectionJob, jobExecution, resultCh)
}
} ExecutePrometheusInspectionTaskbuilds a list of PrometheusRule objects from inspection items and labels, creates a Product that aggregates these rules, and runs them:
func (inspectionExecutionJobService *InspectionExecutionJobService) ExecutePrometheusInspectionTask(inspectionJob *AutoInspection.InspectionJob, jobExecution *AutoInspection.JobExecution, resultCh chan *result.ProductsResult) {
// build Prometheus rules
prometheusRules := make([]*product.PrometheusRule, 0, len(inspectionJob.ItemLabelMaps))
for _, itemLabelMap := range inspectionJob.ItemLabelMaps {
// fetch item and label, then construct rule
// ... omitted for brevity ...
prometheusRule := &product.PrometheusRule{...}
prometheusRules = append(prometheusRules, prometheusRule)
}
prod := &product.Product{Name: inspectionJob.Name, Rules: product.Rules{Prometheus: prometheusRules}}
defer func() {
if r := recover(); r != nil {
global.GVA_LOG.Error("执行巡检任务发生panic", zap.Any("panic", r), zap.String("jobName", inspectionJob.Name))
// send failure result
}
}()
if err := prod.Run(resultCh); err != nil {
global.GVA_LOG.Error("执行巡检任务失败", zap.Error(err), zap.String("jobName", inspectionJob.Name))
return
}
global.GVA_LOG.Info("巡检任务已启动", zap.String("jobName", inspectionJob.Name))
}The PrometheusRule.Run method queries the Prometheus client, builds a RuleResult, and returns it.
func (r *PrometheusRule) Run() (*result.RuleResult, error) {
ds, err := datasource.GetByName(r.DataSourceName)
if err != nil {
return nil, err
}
pds, ok := ds.(*datasource.PrometheusDataSource)
if !ok {
return nil, fmt.Errorf("数据源类型错误: %s 不是Prometheus数据源", r.DataSourceName)
}
if pds.Client == nil {
return nil, fmt.Errorf("数据源为空: %s", r.DataSourceName)
}
res, err := pds.Run(r.Rule, r.LabelFilter)
if err != nil {
return nil, err
}
return r.buildRuleResult(res), nil
}Notification is sent via the sendInspectionNotification function, which formats a markdown or text payload and posts it to an enterprise WeChat webhook.
func (inspectionExecutionJobService *InspectionExecutionJobService) sendInspectionNotification(job *AutoInspection.InspectionExecutionJob, jobExecution *AutoInspection.JobExecution, results []*result.ProductsResult) {
// aggregate counts and abnormal items
// build markdown or text content
// send via sender.Sender
}PDF export uses wkhtmltopdf to render an HTML template of the report and save the file.
func (jobExecutionService *JobExecutionService) GeneratePDF(jobExecution *AutoInspection.JobExecution) (string, error) {
pdf, err := wkhtmltopdf.NewPDFGenerator()
if err != nil {
return "", err
}
// set options, render HTML, add page, create PDF, write to file
return downloadURL, nil
}Tip: The project uses the gin-vue-admin framework, so the built‑in timer component is leveraged for scheduling.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
