Prometheus and Grafana: A Comprehensive Guide to Monitoring, Alerting, and Visualization
This article introduces Prometheus and Grafana as a powerful monitoring stack, explains their architecture, metric collection, storage options, query language, integration with Grafana for dashboards and alerts, and shares practical deployment patterns and high‑availability solutions.
Monitoring and alerting are the foundation of service stability, performance optimization, and proactive issue prevention; modern systems, especially those built with micro‑service architectures, place great emphasis on these capabilities.
Prometheus and Grafana appear together like a golden pair—much as PHP and MySQL once did—offering clear division of responsibilities, ease of use, and high extensibility, making them a popular choice in the observability space.
If you have no clue about metric collection, monitoring, or alerting, choosing Prometheus and Grafana is a safe bet.
1. Prometheus (https://prometheus.io/)
Prometheus is an open‑source project that combines monitoring (charts), alerting, and a time‑series database (TSDB). It collects metrics by periodically pulling data from instrumented endpoints.
1.1 Architecture and Operation
Exporter
To expose metrics, existing services need an exporter. Languages with long‑running processes such as Go or Java provide client libraries to expose standard metrics (e.g., JVM stats, request latency, QPS). For PHP, which is not a resident process, a common approach is to write metrics to a local store like Redis for Prometheus to scrape, or to use the Pushgateway to push metrics.
1.2 Data Storage
Prometheus stores scraped metrics in a local TSDB by default, which satisfies most monitoring scenarios. For persistence, high availability, or migration, Prometheus can be configured with remote_read and remote_write to external storage back‑ends such as InfluxDB or Elasticsearch.
1.3 Metric Types and Sample Data
Metrics are exposed via an HTTP endpoint. A typical sample looks like:
# HELP task_execute_count task execution count
# TYPE task_execute_count counter
task_execute_count{task="test1",instance="host1.huajiao.com",} 10
task_execute_count{task="test1",instance="host2.huajiao.com",} 20
# HELP system_load_average_1m 1‑minute load average
# TYPE system_load_average_1m gauge
system_load_average_1m{application="system-java"} 0.06
# HELP task_consume_all request latency distribution
# TYPE task_consume_all histogram
task_consume_all_bucket{le="10"} 100
task_consume_all_bucket{le="20"} 200
task_consume_all_bucket{le="+Inf"} 100
task_consume_all_sum 10000
task_consume_all_count 400
# HELP go_gc_duration_seconds GC duration
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.326e-05
go_gc_duration_seconds{quantile="0.25"} 3.9552e-05
go_gc_duration_seconds{quantile="0.5"} 4.9175e-05
go_gc_duration_seconds{quantile="0.75"} 6.5348e-05
go_gc_duration_seconds{quantile="1"} 0.000909402
go_gc_duration_seconds_sum 132.156493338
go_gc_duration_seconds_count 2.217437e+06The four metric types are:
Counter : monotonically increasing values such as request counts; suitable for QPS.
Gauge : instantaneous values like CPU load or memory usage.
Histogram : buckets for distribution analysis, e.g., latency.
Summary : similar to histogram but provides quantiles (e.g., 95% of requests under 200 ms).
Counters and gauges reflect current system state, while histograms and summaries help analyze data distribution.
PromQL (Prometheus Query Language)
PromQL is used to query metrics. The web UI (Graph) provides an interactive console.
Instant Vector
Queries that return the latest sample, e.g.:
main_api
main_api{instance="host1.huajiao.com", code="200"}
main_api{code=~"5.*"}Range Vector
Queries over a time range, e.g.:
main_api{code="200"}[1m]
main_api{code="200"}[1m] offset 5mPrometheus supports arithmetic, comparison, logical (and, or, unless), and aggregation operators (sum, count, topk, etc.).
Example Queries
QPS of an interface: rate(main_api[5m]) QPS grouped by API label: sum(rate(main_api{api=~"live/.*"}[1m])) by(api) Top 30 APIs by QPS: topk(30, sum(rate(main_api[1m])) by(api)) 99th percentile latency:
histogram_quantile(0.99, sum(rate(api_consume_all_bucket[1m])) by (le))The rate function converts a counter into per‑second growth.
2. Grafana Integration
Grafana visualizes Prometheus data and provides a richer UI than the native Prometheus graph. Grafana can use Prometheus as a data source and configure alerts.
2.1 Sample Dashboards
QPS chart example:
sum(rate(kong_http_status{job="kong_system"}[1m])) by (service)Latency distribution (90% and 50%):
histogram_quantile(0.9, sum(rate(kong_latency_total_bucket[5m])) by (le))
histogram_quantile(0.5, sum(rate(kong_latency_total_bucket[5m])) by (le))2.2 Alerting
Prometheus includes Alertmanager, but Grafana’s alerting UI is more user‑friendly. Alerts can be sent via webhook, among other channels.
Grafana allows defining alert conditions using functions (e.g., max, sum, avg, diff) and time windows.
3. Application in Huajiao
3.1 Custom Gateway
Because PHP cannot embed an exporter, Huajiao uses Pushgateway, but it introduces single‑point and data‑loss concerns. The team built a high‑performance gateway with a UDP‑based SDK, storing metrics in Redis, supporting point‑in‑time counting, and avoiding the drawbacks of the default Pushgateway.
3.2 High‑Availability Solution
Prometheus stores data locally, which risks data loss on failure. Huajiao deploys multiple Prometheus instances and uses TimescaleDB (a PostgreSQL‑based TSDB) for remote storage via the prometheus‑postgresql‑adapter:
docker run -d -p 9201:9201 timescale/prometheus-postgresql-adapter:latest \
-pg-host=host.to.pgsql \
-pg-password=pwd2pgPrometheus configuration:
remote_write:
- url: "http://127.0.0.1:9201/write"
remote_read:
- url: "http://127.0.0.1:9201/read"3.3 Query Optimization
Complex PromQL queries on large datasets may time out; therefore, pre‑computed rules are defined in rule files and evaluated periodically (controlled by global.evaluation_interval).
rule_files:
- rule1.yml
- rule2.ymlExample rule:
groups:
- name: example
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by (job)4. Conclusion
Prometheus has gained strong momentum and is increasingly adopted in production environments. This article covered its core features, integration with Grafana, alerting, and practical deployment patterns. Topics such as service discovery, Kubernetes integration, and clustering were omitted.
Compared with older monitoring systems like Nagios, Prometheus offers a more suitable solution for modern production workloads and its out‑of‑the‑box usability makes migration straightforward.
5. References
Official documentation: https://prometheus.io/docs/prometheus/latest/
Prometheus Book: https://yunlzheng.gitbook.io/prometheus-book/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huajiao Technology
The Huajiao Technology channel shares the latest Huajiao app tech on an irregular basis, offering a learning and exchange platform for tech enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
