Big Data 10 min read

Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

This article describes how iQIYI built a real‑time, multi‑dimensional log monitoring platform using Spark Streaming, Flink, Kafka and Druid to handle billions of logs, improve alerting accuracy, reduce incident response time, and outline future intelligent monitoring enhancements.

DataFunTalk
DataFunTalk
DataFunTalk
Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

In June 2019 iQIYI's membership reached 100 million users, prompting rapid growth of its machine clusters and exposing limitations of the existing monitoring system.

The new monitoring solution collects four types of logs (Nginx, application, exception, front‑end) in real time, aggregates them, and provides minute‑level alerts.

To handle billions of log events, the architecture adopts Spark Streaming and Flink for stream processing and Druid as the real‑time analytical datastore, with Kafka as the transport layer.

Key components include Venus‑Agent for log collection, Kafka for ingestion, real‑time processing pipelines, Druid for storage and OLAP, and offline jobs for long‑term analysis.

The system offers multi‑dimensional monitoring, automated alerting via thresholds and spikes, and supports operational actions such as degradation, traffic switching, and rate limiting.

Challenges addressed comprise log standardization across VM and QAE deployments, collector performance bottlenecks, resource cost optimization, and latency in Spark/Flink consumption.

Future work focuses on intelligent threshold tuning, traffic‑prediction models, and enhanced automated fault localization.

big dataFlinkReal-time MonitoringDruidSpark StreamingiQIYILog Analytics
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.