Operations 15 min read

Building a Prometheus‑Based Microservice Monitoring System: Practices and Lessons from iQIYI

iQIYI built a Prometheus‑based microservice monitoring platform that uses Spring Boot Actuator, Micrometer, custom QAE collectors, file‑based service discovery, and an Alert‑proxy to gather container, application, and third‑party metrics, define PromQL alert rules, and forward alerts to its unified notification system, illustrating best practices and lessons learned.

iQIYI Technical Product Team

Jul 24, 2020

Building a Prometheus‑Based Microservice Monitoring System: Practices and Lessons from iQIYI

Microservice architecture is widely adopted by Internet companies. While it brings benefits such as clear service boundaries, independent deployment, and easy scaling, it also introduces new challenges for operations, especially in fault detection and alerting.

This article shares iQIYI's experience of constructing a microservice monitoring platform based on Prometheus. It starts with an overview of the four main monitoring approaches—health checks, logs, tracing, and metrics—then focuses on metric‑based monitoring, which is most suitable for cloud‑native environments.

After evaluating traditional solutions like Zabbix, the team selected Prometheus because of its strong community support, ease of deployment, pull‑model data collection, powerful data model, PromQL query language, rich ecosystem, and high performance.

The monitoring system is tailored to iQIYI’s business characteristics. iQIYI’s video‑content platform runs many microservices on an internal cloud platform (QAE). The monitoring solution therefore emphasizes both internal service metrics and third‑party interface metrics.

Key components of the solution include:

Using Spring Boot Actuator and Micrometer to expose service metrics for Prometheus.

Developing a custom qae-monitor collector to fetch container metrics from QAE’s open API.

Implementing a qae-monitor service that formats the data for Prometheus.

Creating a prom-sd-qae file‑based service discovery tool that periodically pulls container lists from QAE and generates Prometheus‑compatible target files.

Building an Alert‑proxy that forwards Prometheus alerts to iQIYI’s unified alerting platform via Alertmanager webhooks.

The monitoring scope is divided into three layers: container‑environment monitoring, application‑service monitoring, and third‑party‑interface monitoring. Typical metrics collected are CPU/memory usage, JVM statistics, request latency, QPS, and success rates.

For third‑party interfaces, the team chose an implicit instrumentation approach using Micrometer’s OkHttpMetricsEventListener, avoiding code intrusion while capturing request latency and success rates.

Alert rules are defined per service using PromQL, covering response time, success rate, QPS, and third‑party interface health. Alerts are routed through Alertmanager to the Alert‑proxy and finally to various notification channels (email, DingTalk, WeChat, etc.).

In conclusion, the article emphasizes that monitoring is an evolving practice that must align with business needs and technology stacks. Continuous improvement—such as alert rule optimization, automated reporting, and intelligent monitoring—will further enhance service reliability and support rapid business growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Microservices service discovery Prometheus

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.