Building RAP: iQIYI’s Real‑Time Big Data Analytics Platform with Druid, Spark & Flink
The article details iQIYI’s RAP platform, describing its real‑time analytics requirements, architectural evolution from RAP 1.x to 2.x, core design steps, integration of Druid, Spark, Flink, and KIS, and showcases business use cases such as membership monitoring, recommendation evaluation, and smart‑TV alerting.
Real‑Time Analysis Requirements
Since 2010 iQIYI’s data volume grew beyond the capabilities of its Hive + MySQL OLAP warehouse, prompting the need for a platform that can deliver minute‑level latency, high‑throughput queries, and flexible multidimensional analysis.
Choosing the right OLAP engine was difficult because many options existed, each with trade‑offs.
Development cost was high: users had to write Spark or Flink jobs and build front‑end reports.
Data latency was poor, ranging from tens of minutes to days.
Maintenance was cumbersome; any change in data sources required reworking the entire pipeline.
RAP (Realtime Analysis Platform) was created to address these pain points by providing a web‑wizard that configures data ingestion, processing, aggregation, reporting, and alerting with only a few clicks.
Architecture Evolution
RAP 1.x
OLAP Selection – After evaluating several engines, Druid was chosen as the underlying OLAP store because it is open‑source, optimized for time‑series data, and offers sub‑second query latency.
Product Design – RAP abstracts the real‑time analytics workflow into five steps: data ingestion, data processing, aggregation, report configuration, and real‑time alerting. Each step is guided by a web wizard.
Data Ingestion – Users select from four source types (user data, service logs, monitoring data, other Kafka sources) via a dropdown; no cluster IPs are required.
Data Processing – RAP translates user‑defined rules into a proprietary StreamingSQL, generating Spark Streaming jobs automatically. Built‑in functions include IP‑to‑province/city/operator conversion.
Aggregation – Users define dimensions and measures (count, distinct count, sum, etc.) through a UI; RAP automatically generates optimized Druid queries and tunes parameters such as task.partition, windowPeriod, and queryGranularity.
Report Configuration – Based on the OLAP model, RAP creates Druid queries, renders visual reports, and suggests appropriate query granularity.
Real‑Time Alerting – Thresholds, YoY/MoM comparisons, and delayed evaluation windows can be configured to reduce false alarms.
These steps reduce end‑to‑end latency from days to about 30 minutes.
RAP 2.x Enhancements
Feedback from RAP 1.x revealed data loss during task restarts, limited Kafka version support, and deprecated Tranquility ingestion.
Kafka Indexing Service (KIS) Integration – KIS provides exactly‑once ingestion from Kafka to Druid. A Supervisor process on the Overlord node manages task lifecycles. RAP 2.x can automatically configure KIS when the Kafka source version is ≥ 0.10.x, otherwise it routes processed data through a shared Kafka cluster.
Flink Compute Engine – Adding Flink enables second‑level processing and maintains exactly‑once semantics, further reducing latency.
Enhanced Diagnostics – RAP 2.x introduces monitoring for stream task latency, real‑time ingestion, and error sampling, allowing operators to pinpoint bottlenecks and data quality issues quickly.
Business Applications
Membership Log Monitoring – Processes hundreds of billions of log events daily, delivering minute‑level alerts and generating over 700 real‑time reports, improving fault‑resolution speed by 80 %.
Recommendation Algorithm Monitoring – RAP tracks clicks, views, and watch time to evaluate bucketed recommendation algorithms, enabling a switch within 30 minutes to improve user experience.
Smart‑TV Real‑Time Alerting – Multidimensional analysis of playback errors (client version, server IP, city, etc.) supports a 5‑minute alert loop and root‑cause tracing.
Future Directions
RAP will continue to deepen monitoring of the analysis pipeline, add finer‑grained diagnostics, improve resource utilization, and strengthen exactly‑once guarantees, ultimately delivering richer analytics such as retention and intelligent analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
