Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo
This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.
Background: Prometheus is a popular open‑source monitoring system that offers easy alert customization and low‑impact metric collection, making it widely adopted across companies.
Liulishuo's Prometheus architecture adds several enhancements:
Storage: All Prometheus data is additionally backed up to Aliyun SLS for redundancy and unified Grafana queries.
Service discovery: An ECS scrape module was developed to monitor the large number of Aliyun ECS instances.
Alerting: PagerDuty, Goalert, and a self‑built notification center handle alerts with scheduling and escalation.
GitOps
Unified management of Prometheus config files in a single Git repository, triggering reloads on changes.
Automatic validation of config changes using promtool and a custom prom‑threshold tool. Example CI snippet: stages: - test check_rules: stage: test image: debian:jessie script: - find ./ -type f -name '*.yml' | grep -E 'alerting|recording' | grep -v 'threshold' | xargs ./bin/promtool-linux check rules tags: - docker check_threshold: stage: test image: prom-threshold:v0.1.1 script: - find ./ -type d -name threshold | egrep '.*' || exit 0 - find ./ -type d -name threshold | xargs /prom_threshold -op check -path tags: - docker
Custom alert thresholds are defined via metric exposure and recording rules. Example configuration: ThresholdMetrics: - metric: http_4xx_ratio_threshold threshold: 0.05 labels: host: ****.com path: /oauth/callback - metric: http_latency_95_seconds_threshold threshold: 20 labels: host: ****.com path: / AlertingRule: - name: http/latency.customized rules: - alert: SlowHTTP expr: service:kong_latency:quantile95_rate1m{type="request"} > on (host,path) group_left() (http_latency_95_seconds_threshold * 1000)
Self‑developed monitoring plugin "Mercury"
Automatic backup of Alertmanager silences to object storage.
Periodic validation of silence rules to remove invalid entries.
Synchronization of Grafana dashboards between production and test environments.
Automatic generation of Alertmanager configurations for PagerDuty integrations.
Cloud resource monitoring
Custom exporters for services such as MySQL, Kafka, HBase, NAT, OSS, RDS, etc., are registered with Prometheus. Example exporter registration code: switch collectType { case "EsExporter": es := exporter.EsExporter(newCmsClient()) prometheus.MustRegister(es) case "HbaseExporter": hbase := exporter.HbaseExporter(newCmsClient()) prometheus.MustRegister(hbase) // ... other cases ... } Alerting rules for cloud resources are also defined, e.g., RDS connection usage alerts.
Alert escalation and scheduling
Each application has multi‑level escalation policies and on‑call schedules to ensure alerts are handled promptly.
Monitoring extensions
Service SLA monitoring using Prometheus recording rules to compute HTTP SLA from Kong metrics and gRPC SLA from Istio metrics.
Future outlook
Evaluate cloud‑native storage solutions such as Cortx to replace the current Aliyun SLS backup.
Refine the alert pipeline by routing alerts through Goalert before the internal notification center.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.