Operations 15 min read

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

iQIYI’s tech product team designed a unified full‑link automated monitoring platform that integrates link, metric, and log collection with deep analysis, enhancing fault localization, performance insight, and scalability across microservices, while addressing limitations of existing tools like ELK, Prometheus, and Dapper.

dbaplus Community

Aug 3, 2020

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

Background

Micro‑service architectures require a unified observability solution that can trace request contexts across service boundaries, aggregate metrics, collect logs at scale, and provide automated root‑cause analysis. Existing tools (Zabbix, Graphite, Prometheus) handle basic metrics; ELK Stack, Cat, and Google Dapper address logs or tracing but lack a cohesive full‑link view and deep‑analysis performance.

Platform Overview

The platform is built around four functional modules:

Link collection : captures call‑graph relationships (spans) and service topology.

Metric collection : standardizes time‑series indicators across business lines.

Log collection : pipelines raw logs to both search and batch stores.

Deep analysis : correlates traces, logs, and client‑side error reports for near‑real‑time fault localization.

1. Link Collection

Each request generates a Trace‑id that links a series of Span records into a directed acyclic graph (DAG). A span stores service name, method, parameters, response time, and custom tags. Two integration modes are supported:

Code‑invasive mode – developers add manual instrumentation following the provided guidelines.

Non‑invasive mode – language‑specific agents (Java, Go, Lua) use probe technology, requiring no code changes.

Query paths are indexed to avoid full‑table scans, enabling fast trace‑level retrieval.

2. Metric Collection

Metrics such as success‑rate, QPS, latency, and percentile latency are unified across all services. The platform defines a common calculation method (e.g., success‑rate = total successes / total requests, with optional weighted aggregation) and aggregates these metrics onto the trace view. This enables instant bottleneck detection and drives automated scaling decisions.

3. Log Collection

The log pipeline consists of two stages:

Stage 1 – classic ELK flow: Logstash/Beats → Kafka → Elasticsearch. Provides flexible ingestion but is limited by ES storage and query performance.

Stage 2 – large‑scale augmentation: Kafka → Hadoop batch processing + Spark/Flink near‑real‑time streams. Hot indexes are stored in HBase/KiKV, raw logs remain in HBase, and Elasticsearch holds only the indexes required for interactive queries.

Key implementation steps:

Client logs are posted via HTTP to Kafka; backend logs are harvested by Logstash.

Kafka topics are pre‑partitioned based on expected traffic to avoid costly repartitioning in downstream Spark jobs.

Spark streaming consumes Kafka partitions in parallel, writing to the appropriate storage component.

Hot indexes are kept in HBase/KiKV; Elasticsearch stores only the minimal searchable fields.

Elasticsearch tuning: increase index.refresh_interval, set replicas=0, use auto‑generated IDs, adjust mappings, and limit batch query size (<10 k documents).

Hardware upgrades (CPU to newer generations, SSDs) were applied to meet sub‑second latency targets.

Log payloads are serialized with Protocol Buffers (size 1/3–1/10 of XML) and compressed with gzip (≈5× reduction).

Business‑line isolation prevents high‑traffic lines (tens of thousands OPS) from affecting others; current throughput reaches hundreds of GB per hour in Elasticsearch and several TB per day in HBase.

4. Deep Analysis

Traditional alerts (OPS, RT, success‑rate, P999) reflect service‑level health but cannot pinpoint user‑level failures. The platform correlates client‑side error reports and customer‑service feedback with trace and log data, performing both offline batch aggregation and real‑time streaming aggregation.

Analysis pipeline:

Logs are grouped by device_id (or an enhanced Trace‑id that embeds the device identifier).

A rule engine (EasyRule) applies multi‑dimensional diagnostic rules; when a rule’s threshold is breached, an alert is generated automatically.

Aggregated metrics that require richer slicing are persisted in ClickHouse, enabling fast OLAP queries.

Overall Architecture

The unified system supports both client‑side and backend tracing, visualizes topology at three levels (business line → service → instance), and overlays real‑time metrics on the trace graph. It can ingest tens of thousands of events per second, store petabytes of compressed logs, and answer queries with sub‑second latency for both real‑time dashboards and offline investigations.

Benefits and Future Work

Standardized metric definitions and unified monitoring across all services.

Rule‑based alerts with automatic root‑cause localization.

Automated log analysis reduces investigation time by >50%.

Cross‑data‑center call detection and capacity‑planning insights.

Planned addition of full‑link load‑testing to enable proactive capacity forecasting for traffic spikes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Observability Metrics log collection full‑link

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.