Building a Prometheus‑Based Microservice Monitoring System: Practices and Lessons from iQIYI
iQIYI built a Prometheus‑based microservice monitoring platform that uses Spring Boot Actuator, Micrometer, custom QAE collectors, file‑based service discovery, and an Alert‑proxy to gather container, application, and third‑party metrics, define PromQL alert rules, and forward alerts to its unified notification system, illustrating best practices and lessons learned.
Microservice architecture is widely adopted by Internet companies. While it brings benefits such as clear service boundaries, independent deployment, and easy scaling, it also introduces new challenges for operations, especially in fault detection and alerting.
This article shares iQIYI's experience of constructing a microservice monitoring platform based on Prometheus. It starts with an overview of the four main monitoring approaches—health checks, logs, tracing, and metrics—then focuses on metric‑based monitoring, which is most suitable for cloud‑native environments.
After evaluating traditional solutions like Zabbix, the team selected Prometheus because of its strong community support, ease of deployment, pull‑model data collection, powerful data model, PromQL query language, rich ecosystem, and high performance.
The monitoring system is tailored to iQIYI’s business characteristics. iQIYI’s video‑content platform runs many microservices on an internal cloud platform (QAE). The monitoring solution therefore emphasizes both internal service metrics and third‑party interface metrics.
Key components of the solution include:
Using Spring Boot Actuator and Micrometer to expose service metrics for Prometheus.
Developing a custom qae-monitor collector to fetch container metrics from QAE’s open API.
Implementing a qae-monitor service that formats the data for Prometheus.
Creating a prom-sd-qae file‑based service discovery tool that periodically pulls container lists from QAE and generates Prometheus‑compatible target files.
Building an Alert‑proxy that forwards Prometheus alerts to iQIYI’s unified alerting platform via Alertmanager webhooks.
The monitoring scope is divided into three layers: container‑environment monitoring, application‑service monitoring, and third‑party‑interface monitoring. Typical metrics collected are CPU/memory usage, JVM statistics, request latency, QPS, and success rates.
For third‑party interfaces, the team chose an implicit instrumentation approach using Micrometer’s OkHttpMetricsEventListener, avoiding code intrusion while capturing request latency and success rates.
Alert rules are defined per service using PromQL, covering response time, success rate, QPS, and third‑party interface health. Alerts are routed through Alertmanager to the Alert‑proxy and finally to various notification channels (email, DingTalk, WeChat, etc.).
In conclusion, the article emphasizes that monitoring is an evolving practice that must align with business needs and technology stacks. Continuous improvement—such as alert rule optimization, automated reporting, and intelligent monitoring—will further enhance service reliability and support rapid business growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
