Operations 9 min read

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.

Liulishuo Tech Team

May 26, 2021

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

Background: Prometheus is a popular open‑source monitoring system that offers easy alert customization and low‑impact metric collection, making it widely adopted across companies.

Liulishuo's Prometheus architecture adds several enhancements:

Storage: All Prometheus data is additionally backed up to Aliyun SLS for redundancy and unified Grafana queries.

Service discovery: An ECS scrape module was developed to monitor the large number of Aliyun ECS instances.

Alerting: PagerDuty, Goalert, and a self‑built notification center handle alerts with scheduling and escalation.

GitOps

Unified management of Prometheus config files in a single Git repository, triggering reloads on changes.

Automatic validation of config changes using promtool and a custom prom‑threshold tool. Example CI snippet:

stages:
  - test
check_rules:
  stage: test
  image: debian:jessie
  script:
    - find ./ -type f -name '*.yml' | grep -E 'alerting|recording' | grep -v 'threshold' | xargs ./bin/promtool-linux check rules
  tags:
    - docker
check_threshold:
  stage: test
  image: prom-threshold:v0.1.1
  script:
    - find ./ -type d -name threshold | egrep '.*' || exit 0
    - find ./ -type d -name threshold | xargs /prom_threshold -op check -path
  tags:
    - docker

Custom alert thresholds are defined via metric exposure and recording rules. Example configuration:

ThresholdMetrics:
  - metric: http_4xx_ratio_threshold
    threshold: 0.05
    labels:
      host: ****.com
      path: /oauth/callback
  - metric: http_latency_95_seconds_threshold
    threshold: 20
    labels:
      host: ****.com
      path: /
AlertingRule:
  - name: http/latency.customized
    rules:
      - alert: SlowHTTP
        expr: service:kong_latency:quantile95_rate1m{type="request"} > on (host,path) group_left() (http_latency_95_seconds_threshold * 1000)

Self‑developed monitoring plugin "Mercury"

Automatic backup of Alertmanager silences to object storage.

Periodic validation of silence rules to remove invalid entries.

Synchronization of Grafana dashboards between production and test environments.

Automatic generation of Alertmanager configurations for PagerDuty integrations.

Cloud resource monitoring

Custom exporters for services such as MySQL, Kafka, HBase, NAT, OSS, RDS, etc., are registered with Prometheus. Example exporter registration code:

switch collectType {
case "EsExporter":
    es := exporter.EsExporter(newCmsClient())
    prometheus.MustRegister(es)
case "HbaseExporter":
    hbase := exporter.HbaseExporter(newCmsClient())
    prometheus.MustRegister(hbase)
// ... other cases ...
}

Alerting rules for cloud resources are also defined, e.g., RDS connection usage alerts.

Alert escalation and scheduling

Each application has multi‑level escalation policies and on‑call schedules to ensure alerts are handled promptly.

Monitoring extensions

Service SLA monitoring using Prometheus recording rules to compute HTTP SLA from Kong metrics and gRPC SLA from Istio metrics.

Future outlook

Evaluate cloud‑native storage solutions such as Cortx to replace the current Aliyun SLS backup.

Refine the alert pipeline by routing alerts through Goalert before the internal notification center.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring cloud-native alerting

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.