Artificial Intelligence 28 min read

iQIYI Deep Semantic Representation Learning Framework: Design, Challenges, and Applications

iQIYI’s deep semantic representation learning framework integrates multimodal content, knowledge graphs, and user behavior through layered data, feature, strategy, and application components, employing early, late, and hybrid fusion with Transformers, GCNs, and other deep models to deliver high‑quality embeddings that boost recommendation, search, and streaming performance across dozens of business scenarios.

iQIYI Technical Product Team

May 15, 2020

iQIYI Deep Semantic Representation Learning Framework: Design, Challenges, and Applications

The article introduces iQIYI's deep semantic representation learning framework, which was designed based on both academic research and industrial practice to serve a variety of business scenarios such as recommendation, search, and live streaming.

Background : The concept of distributed representation dates back to J.R. Firth (1957) and was formalized by Hinton (1986). Word2vec (2003) popularized word embeddings, leading to a surge of embedding techniques across text, image, video, and audio domains. Embeddings transform high‑dimensional sparse one‑hot vectors into low‑dimensional dense semantic vectors, enabling similarity measurement and serving as features for downstream models.

Motivation : Traditional shallow embedding models cannot incorporate rich side information (e.g., multimodal content, knowledge graphs) and suffer from cold‑start and poor generalization. iQIYI therefore proposes a deep semantic learning framework that fuses side information with powerful deep models (Transformer, GCN, etc.) to produce high‑quality, generalizable embeddings.

Challenges :

Entity diversity: multiple entity types (text, image, video, user, community, query) and heterogeneous relationships.

Rich side information: multimodal content and meta attributes require effective extraction and fusion.

Varied business scenarios: recall, ranking, deduplication, diversity control in recommendation; semantic recall, relevance matching, clustering in search.

Framework Overview : The system consists of four layers – Data Layer, Feature Layer, Strategy Layer, and Application Layer.

Data Layer : Collects user behavior logs to construct sequences and graphs for training data.

Feature Layer : Extracts and fuses multimodal features (text, image, audio, video). Text features are obtained via pretrained language models (e.g., BERT, ALBERT) and enhanced with topic modeling (LDA) and word‑level aggregation (WME, CPTW). Image features use ImageNet‑based classifiers (EfficientNet) and self‑supervised learning (Selfie). Audio features use Vggish on audio waveforms, while video features are represented by aggregated key‑frame embeddings.

Fusion Strategies : Includes early fusion (concatenation before learning), late fusion (individual feature learning then merging), and hybrid fusion (combination of both) to capture cross‑modal interactions.

Deep Semantic Models :

Content‑based models: use only node content (metadata, multimodal signals) with supervised training; examples include ImageNet‑based image embeddings and task‑specific embeddings for video tags.

Matching‑based models: combine content and user behavior via Siamese or multi‑tower architectures (e.g., DSSM, CDML) with late or hybrid fusion.

Sequence‑based models: replace shallow sequential networks with Transformers (SASRec, BERT4Rec, XLNet) to capture long‑range dependencies in user behavior.

Graph‑based models: incorporate graph structure and side information using Graph Neural Networks (GCN, PinSAGE, ClusterGCN) or scalable methods like ProNE; heterogeneous graph embeddings (HINE, HGT, GATNE‑I) handle multiple node/edge types.

Experimental Results : The framework has been deployed in 15 iQIYI business lines, covering 7 scenarios (recall, ranking, deduplication, diversity, semantic matching, clustering). A/B tests show an increase of over 5 minutes in average user watch time for short & small videos and a >6% lift in search semantic relevance accuracy compared to baseline single‑feature models.

Future Directions :

Develop video‑level pretrained semantic models (e.g., UniViLM) using large‑scale video captioning data.

Integrate knowledge‑graph priors into text and video representations (e.g., KEPLER, KGCN) to enhance semantic quality.

Expand the framework to additional iQIYI services such as intelligent content creation and other multimedia scenarios.

References : The article lists 35 citations covering foundational works on word embeddings, Transformer, GCN, DSSM, various multimodal pooling methods, and recent advances in graph representation learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation Multimodal Graph Neural Networks iQIYI Search semantic embedding

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.