How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform
Opal is iQIYI's end‑to‑end machine‑learning platform that integrates feature production, sample construction, model training, and deployment with big‑data services, addressing duplicated effort, weak data processing, and fragmented pipelines to boost efficiency across recommendation, advertising, and risk‑control scenarios.
Overview
Opal is a machine‑learning platform built by iQIYI’s big‑data team, offering a one‑stop solution that spans feature production, sample construction, model training, and deployment. It combines BigData and AI services, providing low‑entry barriers and high performance for recommendation, advertising, and risk‑control business lines.
Background and Problem Statement
As machine‑learning algorithms become central to many business scenarios, teams face heavy engineering overhead unrelated to core algorithms, such as environment setup, framework compatibility, task scheduling, and integration with big‑data systems. This leads to duplicated development, limited data‑processing capability, fragmented pipelines, and missing tools for critical steps like metric collection.
Design Goals
Opal aims to deliver a one‑stop, BigData + AI development platform that streamlines all stages—from raw logs to feature validation, sample construction, model training, A/B testing, deployment, and online feature access—thereby accelerating the creation of intelligent applications.
Core Features
Feature Management : Hosted feature service supporting offline/online production, quality checks, sample construction, and unified access.
Model Management : Integrated workflow scheduling, resource management, support for single‑node and distributed training, heterogeneous resources (CPU/GPU), multiple frameworks (TensorFlow, PyTorch), deployment, A/B testing, and cross‑cluster data I/O.
Efficiency Tools : Cloud‑based JupyterLab environment, one‑click environment deployment, ad‑hoc big‑data analysis, and pre‑installed common Python packages.
Feature Production
Users build DAGs by dragging operators to load raw logs, transform data, and write features to various storage formats. Real‑time syntax validation shows operator output fields, reducing debugging time. The platform also provides alerting, version management, and re‑run capabilities.
Feature Validation
Opal integrates Amazon’s open‑source Deequ framework for automated feature validation, offering declarative APIs, large‑scale Spark‑based checks, and visual result dashboards. Opal extends Deequ to bind validation tasks to feature groups, supports multi‑cluster submission, adds complex feature type checks, provides a zero‑code web UI, and visualizes results with charts and trend comparisons.
Sample Construction
Samples are treated as special feature groups linked to user and item IDs, enabling traceability from model back to upstream feature tasks and facilitating feature‑model association checks.
Model Training
Training integrates iQIYI’s QBFS for cross‑cluster HDFS routing and Kerberos authentication, and uses the Gear workflow engine for job scheduling, monitoring, and alerts. Distributed training is enabled via two approaches:
Adapted open‑source TonY (TensorFlow on YARN) to run ML jobs on Hadoop, integrated with iQIYI’s scheduler and resource manager.
Integration with the in‑house Jarvis platform for GPU‑accelerated training and unified management of training, feature production, and downstream tasks.
Visualization and Workflow Construction
Opal offers a visual experiment builder where users drag‑and‑drop nodes, configure parameters, and launch experiments, simplifying workflow creation compared to command‑line approaches.
Model Deployment
Deployment is handled via integration with the Jarvis platform; Opal does not provide native deployment capabilities but orchestrates the process through Jarvis.
Online Feature Serving
Features are materialized into online KV stores via feature views, with a unified SDK for retrieval, supporting real‑time or batch updates and custom post‑processing expressions.
Efficiency Tools
JupyterLab is customized to provide workspace auto‑save/recovery, cross‑cluster Hadoop client access, and pre‑installed data‑science libraries (TensorFlow, Pandas, Scikit‑learn, etc.). Users can submit PySpark ad‑hoc jobs directly from notebooks.
Business Practice Cases
Advertising Team : Replaced a fragmented offline evaluation pipeline with Opal, achieving a 3× speedup in feature evaluation and reducing the evaluation cycle from 5 days to 1.5 days, eliminating backlog and boosting ad revenue.
Risk‑Control Team : Consolidated batch and real‑time feature services, cutting latency from 30 ms to 10 ms (13× faster) and reducing real‑time feature data accumulation by 88 %, leading to a 5.5 % increase in risk detection accuracy.
Recommendation Platform : Unified feature production and validation, improving feature generation speed by 0.4–1× and reducing validation effort by 20–50 % while enhancing overall feature quality.
Future Roadmap
Real‑time feature production and quality validation.
Performance optimization of online feature services via caching and encoding improvements.
Online learning for faster model updates.
Privacy‑preserving computation using federated learning and secure multi‑party computation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
