iQIYI Deep Semantic Representation Learning Framework: Design, Challenges, and Applications
iQIYI’s deep semantic representation learning framework integrates multimodal content, knowledge graphs, and user behavior through layered data, feature, strategy, and application components, employing early, late, and hybrid fusion with Transformers, GCNs, and other deep models to deliver high‑quality embeddings that boost recommendation, search, and streaming performance across dozens of business scenarios.
The article introduces iQIYI's deep semantic representation learning framework, which was designed based on both academic research and industrial practice to serve a variety of business scenarios such as recommendation, search, and live streaming.
Background : The concept of distributed representation dates back to J.R. Firth (1957) and was formalized by Hinton (1986). Word2vec (2003) popularized word embeddings, leading to a surge of embedding techniques across text, image, video, and audio domains. Embeddings transform high‑dimensional sparse one‑hot vectors into low‑dimensional dense semantic vectors, enabling similarity measurement and serving as features for downstream models.
Motivation : Traditional shallow embedding models cannot incorporate rich side information (e.g., multimodal content, knowledge graphs) and suffer from cold‑start and poor generalization. iQIYI therefore proposes a deep semantic learning framework that fuses side information with powerful deep models (Transformer, GCN, etc.) to produce high‑quality, generalizable embeddings.
Challenges :
Entity diversity: multiple entity types (text, image, video, user, community, query) and heterogeneous relationships.
Rich side information: multimodal content and meta attributes require effective extraction and fusion.
Varied business scenarios: recall, ranking, deduplication, diversity control in recommendation; semantic recall, relevance matching, clustering in search.
Framework Overview : The system consists of four layers – Data Layer, Feature Layer, Strategy Layer, and Application Layer.
Data Layer : Collects user behavior logs to construct sequences and graphs for training data.
Feature Layer : Extracts and fuses multimodal features (text, image, audio, video). Text features are obtained via pretrained language models (e.g., BERT, ALBERT) and enhanced with topic modeling (LDA) and word‑level aggregation (WME, CPTW). Image features use ImageNet‑based classifiers (EfficientNet) and self‑supervised learning (Selfie). Audio features use Vggish on audio waveforms, while video features are represented by aggregated key‑frame embeddings.
Fusion Strategies : Includes early fusion (concatenation before learning), late fusion (individual feature learning then merging), and hybrid fusion (combination of both) to capture cross‑modal interactions.
Deep Semantic Models :
Content‑based models: use only node content (metadata, multimodal signals) with supervised training; examples include ImageNet‑based image embeddings and task‑specific embeddings for video tags.
Matching‑based models: combine content and user behavior via Siamese or multi‑tower architectures (e.g., DSSM, CDML) with late or hybrid fusion.
Sequence‑based models: replace shallow sequential networks with Transformers (SASRec, BERT4Rec, XLNet) to capture long‑range dependencies in user behavior.
Graph‑based models: incorporate graph structure and side information using Graph Neural Networks (GCN, PinSAGE, ClusterGCN) or scalable methods like ProNE; heterogeneous graph embeddings (HINE, HGT, GATNE‑I) handle multiple node/edge types.
Experimental Results : The framework has been deployed in 15 iQIYI business lines, covering 7 scenarios (recall, ranking, deduplication, diversity, semantic matching, clustering). A/B tests show an increase of over 5 minutes in average user watch time for short & small videos and a >6% lift in search semantic relevance accuracy compared to baseline single‑feature models.
Future Directions :
Develop video‑level pretrained semantic models (e.g., UniViLM) using large‑scale video captioning data.
Integrate knowledge‑graph priors into text and video representations (e.g., KEPLER, KGCN) to enhance semantic quality.
Expand the framework to additional iQIYI services such as intelligent content creation and other multimedia scenarios.
References : The article lists 35 citations covering foundational works on word embeddings, Transformer, GCN, DSSM, various multimodal pooling methods, and recent advances in graph representation learning.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.