Operations 12 min read

How iQIYI Scaled Real‑Time Log Monitoring for 100M+ Users with Spark, Flink and Druid

Facing a surge to over 100 million members, iQIYI rebuilt its monitoring stack by ingesting four log types, adopting Spark Streaming, Flink and Druid for real‑time analysis, and optimizing resource usage, which cut incident resolution time by more than 80 % while supporting billion‑level data volumes.

dbaplus Community

Sep 14, 2020

How iQIYI Scaled Real‑Time Log Monitoring for 100M+ Users with Spark, Flink and Druid

Background and Motivation

In June 2019 iQIYI’s paid‑member base exceeded 100 million, causing a rapid increase in machine clusters and exposing limitations of the existing monitoring system. The company needed a more efficient way to locate problems, reduce MTTR, and prevent customer complaints.

Traditional Monitoring Pain Points

Earlier monitoring relied on shell or Python scripts deployed on individual VMs; once a single VM crossed an error threshold, an alarm was triggered. This approach suffered from:

Fragmented log collection across multiple dimensions.

High latency and coarse‑grained alerts.

Inability to correlate network, application, and front‑end metrics.

Technical Solution Overview

To address these issues the team built a four‑dimensional real‑time log pipeline (access, exception, Nginx, front‑end logs) and selected Spark Streaming and Flink as stream‑processing engines and Druid as the real‑time analytical datastore. The architecture consists of:

Data collection: A custom Venus‑Agent (based on Filebeat) runs on each host and forwards logs to Kafka.

Real‑time processing: Spark Streaming (micro‑batch) and Flink (native stream) parse, filter, aggregate, and write results to Druid.

Storage: Druid stores both raw metrics and derived thresholds for sub‑second OLAP queries.

Offline analysis: Data from Druid and MySQL feed daily/weekly reports and model training.

Key Monitoring Features

The platform provides minute‑level alerts for over 400 anomaly types, supporting push channels such as hot‑chat, email and phone. Specific log‑based analyses include:

Nginx logs: Capture network‑layer data (status codes, RT, IP, region) and generate fine‑grained alerts.

Front‑end delivery logs: Measure page load, API latency, and static‑resource timing from the user perspective, enabling ISP or region‑specific optimizations.

Business access logs: Monitor service status codes and extract error codes for rapid root‑cause identification.

Exception logs: Detect runtime exceptions (e.g., ResourceAccessException, NullPointerException) that directly impact user experience.

Network operation data: Use Nginx traffic statistics to guide capacity planning and detect uneven load across data centers.

Challenges Encountered

1. Log Standardization

Unified log formats were required across VMs and QAE containers, covering 80 % VM‑deployed and 20 % container‑deployed services.

2. Collection Performance and Latency

Venus‑Agent runs under cgroup limits (1 CPU, 128 MiB by default). High‑traffic services needed streamlined extraction rules, reduced duplicate logs, and occasional sampling to avoid overwhelming the pipeline.

3. Resource Cost Optimization

Druid tasks can handle ~150 k QPS; scaling tasks does not linearly increase throughput, so the team split high‑QPS streams into multiple Kafka topics and Druid datasources, saving roughly 120 cores.

4. Stream Processing Latency

Kafka partition counts and Druid task numbers were tuned for minimal delay. Spark Streaming’s micro‑batch nature was mitigated by increasing spark.streaming.concurrentJobs to boost parallelism.

Results and Impact

The new platform delivers minute‑level alerts, processes 10⁸–10⁹ logs daily, and improves incident‑investigation efficiency by over 80 %. It has already prevented more than 90 customer‑complaint incidents and generated 4 800+ actionable alerts across 400+ anomaly categories.

Future Directions

Intelligent threshold management using historical data and automated adjustments.

Traffic‑prediction models built on Nginx logs to forecast peak loads.

Enhanced automated root‑cause analysis to reduce false‑positive and missed alerts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Operations Kafka real-time monitoring Druid Spark Streaming

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.