Operations 9 min read

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.

Liulishuo Tech Team
Liulishuo Tech Team
Liulishuo Tech Team
Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

Background: Prometheus is a popular open‑source monitoring system that offers easy alert customization and low‑impact metric collection, making it widely adopted across companies.

Liulishuo's Prometheus architecture adds several enhancements:

Storage: All Prometheus data is additionally backed up to Aliyun SLS for redundancy and unified Grafana queries.

Service discovery: An ECS scrape module was developed to monitor the large number of Aliyun ECS instances.

Alerting: PagerDuty, Goalert, and a self‑built notification center handle alerts with scheduling and escalation.

GitOps

Unified management of Prometheus config files in a single Git repository, triggering reloads on changes.

Automatic validation of config changes using promtool and a custom prom‑threshold tool. Example CI snippet: stages: - test check_rules: stage: test image: debian:jessie script: - find ./ -type f -name '*.yml' | grep -E 'alerting|recording' | grep -v 'threshold' | xargs ./bin/promtool-linux check rules tags: - docker check_threshold: stage: test image: prom-threshold:v0.1.1 script: - find ./ -type d -name threshold | egrep '.*' || exit 0 - find ./ -type d -name threshold | xargs /prom_threshold -op check -path tags: - docker

Custom alert thresholds are defined via metric exposure and recording rules. Example configuration: ThresholdMetrics: - metric: http_4xx_ratio_threshold threshold: 0.05 labels: host: ****.com path: /oauth/callback - metric: http_latency_95_seconds_threshold threshold: 20 labels: host: ****.com path: / AlertingRule: - name: http/latency.customized rules: - alert: SlowHTTP expr: service:kong_latency:quantile95_rate1m{type="request"} > on (host,path) group_left() (http_latency_95_seconds_threshold * 1000)

Self‑developed monitoring plugin "Mercury"

Automatic backup of Alertmanager silences to object storage.

Periodic validation of silence rules to remove invalid entries.

Synchronization of Grafana dashboards between production and test environments.

Automatic generation of Alertmanager configurations for PagerDuty integrations.

Cloud resource monitoring

Custom exporters for services such as MySQL, Kafka, HBase, NAT, OSS, RDS, etc., are registered with Prometheus. Example exporter registration code: switch collectType { case "EsExporter": es := exporter.EsExporter(newCmsClient()) prometheus.MustRegister(es) case "HbaseExporter": hbase := exporter.HbaseExporter(newCmsClient()) prometheus.MustRegister(hbase) // ... other cases ... } Alerting rules for cloud resources are also defined, e.g., RDS connection usage alerts.

Alert escalation and scheduling

Each application has multi‑level escalation policies and on‑call schedules to ensure alerts are handled promptly.

Monitoring extensions

Service SLA monitoring using Prometheus recording rules to compute HTTP SLA from Kong metrics and gRPC SLA from Istio metrics.

Future outlook

Evaluate cloud‑native storage solutions such as Cortx to replace the current Aliyun SLS backup.

Refine the alert pipeline by routing alerts through Goalert before the internal notification center.

Monitoringcloud-nativealertingprometheusInfrastructureGitOps
Liulishuo Tech Team
Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.