Big Data 11 min read

Building a Real-Time Computing Platform with Apache Flink at iQIYI: Architecture, Improvements, and Business Cases

iQIYI’s senior data engineer shares the evolution of its big‑data services from Hadoop to a Flink‑based real‑time computing platform, detailing architecture, monitoring improvements, StreamingSQL capabilities, business use cases like recommendation and deep‑learning data generation, and future plans for unified stream‑batch processing.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Real-Time Computing Platform with Apache Flink at iQIYI: Architecture, Improvements, and Business Cases

Overview

The presentation outlines iQIYI’s journey in big‑data services from a 20‑node Hadoop cluster in 2012 to a large‑scale real‑time computing platform built on Apache Flink, supporting over 15,000 nodes, 800 jobs, and a daily data flow of roughly 2,500 TB.

Flink Adoption and Platform Architecture

Flink was introduced in 2017, forming the core of the StreamingSQL engine alongside Spark. The platform’s lower‑level storage utilizes HDFS, HBase, Kafka, and OSS. Real‑time jobs are managed via a web IDE that handles task development, deployment, versioning, and monitoring.

Improvements to Flink

Three‑level monitoring: Job‑level (status, checkpoints), Operator‑level (latency, back‑pressure, flow), and TaskManager‑level (CPU, memory, JVM GC).

State management enhancements: Use of Savepoints to recover from job restarts and break long checkpoint dependency chains.

StreamingSQL Features

SQL‑based stream processing, eliminating the need for Scala code.

Support for DDL (stream, temporary, dimension, result tables).

Built‑in and user‑defined functions (UDFs) with an integrated SQL editor.

Real‑Time Computing Management Platform

Provides file management, function management, version control, monitoring dashboards, alert subscriptions, resource auditing, and anomaly diagnosis.

Real‑Time Data Processing Platform Evolution

2015‑2016: Venus 1.0 data collection using Flume; limited flexibility.

2017‑2018: Venus 2.0 migrates filtering to Flink with dual‑Kafka, enabling dynamic rule changes.

2019: Venus 3.0 introduces a stream‑data production and distribution platform with configurable operators (Projection, Filter, Split, Union, Window, UDF) and reduces data duplication.

Business Cases

Information‑flow recommendation: Real‑time user behavior is ingested into Kafka, processed by Flink, and fed to the recommendation engine, reducing latency from 1 minute to 1‑2 seconds and improving CTR.

Deep‑learning training data generation: Real‑time pipelines produce delta features and join with HBase dimension tables to create training sessions, cutting model update cycles from 6 hours to 1 hour.

Exactly‑Once processing: Exploring Kafka Exactly‑Once Semantics combined with Flink two‑phase commit, accepting a ~20% performance overhead.

Challenges and Future Plans

Unified stream‑batch processing.

Further SQL‑ification via StreamingSQL.

Exploration of Flink‑based machine learning.

Improved resource utilization and dynamic scaling.

Deployment of Flink on Kubernetes.

big dataFlinkApache Flinkdata platformReal-Time ComputingiQIYIStreamingSQL
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.