Microservice Monitoring Practices at iQIYI: Architecture, Metrics, and Automation
iQIYI’s micro‑service monitoring combines low‑cost automatic instrumentation, declarative method metrics, and push‑gateway data into a unified multi‑dimensional schema, visualized centrally in Grafana and managed with standardized alert rules, demonstrating that simple integration, centralized dashboards, and early‑stage governance enable rapid anomaly detection and effective incident response.
As frontline developers, we often face questions such as: when taking over a new system, what is the traffic of each interface, which business units are calling it, and how to quickly locate the impact range when massive alerts occur? Is a timeout caused by the client, a slow server response, or network fluctuations? Monitoring is essential, but the extra work to integrate it can be daunting.
This article shares iQIYI's experience with micro‑service monitoring, covering the background, evolution, and practical implementations.
Background & Exploration
After more than a year of rapid growth, the information‑flow team was responsible for over five micro‑services per engineer. To understand each service’s runtime status and quickly detect anomalies, the team experimented with several monitoring solutions, including log monitoring, Hystrix circuit‑breaker monitoring, Actuator endpoints, and probing.
At this stage, the team lacked systematic theory and practical experience, so they mainly integrated existing monitoring infrastructure with low‑cost adaptations.
Monitoring Types
Log Monitoring : Uses mature solutions like ELK or the company’s Venus platform. Advantages: easy to adopt; disadvantages: long pipeline and delayed alerts.
Hystrix Monitoring : Provides circuit‑breaker metrics via the Hystrix Dashboard. Low cost, but metrics are not persisted by default.
Actuator Monitoring : Spring Boot Actuator exposes health and metric endpoints. Low cost, but multi‑instance aggregation and persistence require additional development.
Probing (Synthetic) Monitoring : Periodically tests service availability from the user’s perspective, useful for user‑facing services.
Link (Tracing) Monitoring : Tracks request flow across services, essential for distributed debugging.
Evolution & Practice
The team built a unified monitoring model based on a multi‑dimensional data schema (metric‑dimension‑value). Common metrics include QPS, latency (TP99, TP95, MEAN), error count, and resource utilization. Common dimensions include service_name, data_center (dc), instance, URL, method, and HTTP status.
Key practices:
Automatic Instrumentation & Data Collection : Each service imports an SDK that automatically exposes standard metrics (HTTP, JVM) and registers with Eureka. The monitoring system discovers services via the registry and pulls metrics.
Declarative Method Monitoring : Instead of manual instrumentation, developers annotate methods, and the framework injects timing metrics.
PUSH Mode Extension : For short‑lived tasks, the team uses Prometheus Pushgateway to push metrics, e.g., real‑time exception aggregation from Kafka/Flink pipelines.
Centralized Visualization
All metrics are visualized in Grafana using unified dashboards that include dimension filters, JVM/system metrics, traffic distribution, status codes, QPS, latency, and method‑level metrics.
Unified Alerting
Based on the common metrics and dimensions, default alert rules are configured for all indicators, with support for custom thresholds. Alerts are sent via Grafana Webhook to Alertmanager, which deduplicates and formats them before delivering to iQIYI’s unified alert platform.
Summary & Lessons
The practice shows that simple, low‑cost instrumentation, integration with existing infrastructure, and centralized visualization are key to successful monitoring adoption. Important takeaways include:
Simplicity and Effectiveness : Monitoring should be invisible during normal operation and provide all needed metrics during incidents.
Integration & Customization : Leverage mature frameworks (Prometheus, Spring Cloud) and add custom adapters where needed.
Centralized Management : Consolidate dashboards and alerts to reduce fragmentation.
Process Governance : Embed monitoring considerations early in design and development, and provide training to ensure consistent adoption.
Future work will focus on smarter alert rules, lower‑cost instrumentation, AI‑driven anomaly detection, root‑cause analysis, and self‑healing capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
