Artificial Intelligence 19 min read

How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform

Opal is iQIYI's end‑to‑end machine‑learning platform that integrates feature production, sample construction, model training, and deployment with big‑data services, addressing duplicated effort, weak data processing, and fragmented pipelines to boost efficiency across recommendation, advertising, and risk‑control scenarios.

iQIYI Technical Product Team

May 31, 2024

How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform

Overview

Opal is a machine‑learning platform built by iQIYI’s big‑data team, offering a one‑stop solution that spans feature production, sample construction, model training, and deployment. It combines BigData and AI services, providing low‑entry barriers and high performance for recommendation, advertising, and risk‑control business lines.

Background and Problem Statement

As machine‑learning algorithms become central to many business scenarios, teams face heavy engineering overhead unrelated to core algorithms, such as environment setup, framework compatibility, task scheduling, and integration with big‑data systems. This leads to duplicated development, limited data‑processing capability, fragmented pipelines, and missing tools for critical steps like metric collection.

Design Goals

Opal aims to deliver a one‑stop, BigData + AI development platform that streamlines all stages—from raw logs to feature validation, sample construction, model training, A/B testing, deployment, and online feature access—thereby accelerating the creation of intelligent applications.

Core Features

Feature Management : Hosted feature service supporting offline/online production, quality checks, sample construction, and unified access.

Model Management : Integrated workflow scheduling, resource management, support for single‑node and distributed training, heterogeneous resources (CPU/GPU), multiple frameworks (TensorFlow, PyTorch), deployment, A/B testing, and cross‑cluster data I/O.

Efficiency Tools : Cloud‑based JupyterLab environment, one‑click environment deployment, ad‑hoc big‑data analysis, and pre‑installed common Python packages.

Feature Production

Users build DAGs by dragging operators to load raw logs, transform data, and write features to various storage formats. Real‑time syntax validation shows operator output fields, reducing debugging time. The platform also provides alerting, version management, and re‑run capabilities.

Feature Validation

Opal integrates Amazon’s open‑source Deequ framework for automated feature validation, offering declarative APIs, large‑scale Spark‑based checks, and visual result dashboards. Opal extends Deequ to bind validation tasks to feature groups, supports multi‑cluster submission, adds complex feature type checks, provides a zero‑code web UI, and visualizes results with charts and trend comparisons.

Sample Construction

Samples are treated as special feature groups linked to user and item IDs, enabling traceability from model back to upstream feature tasks and facilitating feature‑model association checks.

Model Training

Training integrates iQIYI’s QBFS for cross‑cluster HDFS routing and Kerberos authentication, and uses the Gear workflow engine for job scheduling, monitoring, and alerts. Distributed training is enabled via two approaches:

Adapted open‑source TonY (TensorFlow on YARN) to run ML jobs on Hadoop, integrated with iQIYI’s scheduler and resource manager.

Integration with the in‑house Jarvis platform for GPU‑accelerated training and unified management of training, feature production, and downstream tasks.

Visualization and Workflow Construction

Opal offers a visual experiment builder where users drag‑and‑drop nodes, configure parameters, and launch experiments, simplifying workflow creation compared to command‑line approaches.

Model Deployment

Deployment is handled via integration with the Jarvis platform; Opal does not provide native deployment capabilities but orchestrates the process through Jarvis.

Online Feature Serving

Features are materialized into online KV stores via feature views, with a unified SDK for retrieval, supporting real‑time or batch updates and custom post‑processing expressions.

Efficiency Tools

JupyterLab is customized to provide workspace auto‑save/recovery, cross‑cluster Hadoop client access, and pre‑installed data‑science libraries (TensorFlow, Pandas, Scikit‑learn, etc.). Users can submit PySpark ad‑hoc jobs directly from notebooks.

Business Practice Cases

Advertising Team : Replaced a fragmented offline evaluation pipeline with Opal, achieving a 3× speedup in feature evaluation and reducing the evaluation cycle from 5 days to 1.5 days, eliminating backlog and boosting ad revenue.

Risk‑Control Team : Consolidated batch and real‑time feature services, cutting latency from 30 ms to 10 ms (13× faster) and reducing real‑time feature data accumulation by 88 %, leading to a 5.5 % increase in risk detection accuracy.

Recommendation Platform : Unified feature production and validation, improving feature generation speed by 0.4–1× and reducing validation effort by 20–50 % while enhancing overall feature quality.