Artificial Intelligence 16 min read

Large-Scale Hierarchical Classification Algorithm for iQIYI Short Videos

iQIYI’s large‑scale hierarchical classification system combines multimodal text and image embeddings, low‑rank multimodal fusion, and a dense hierarchical multilabel network with cascade‑style weighting to assign accurate type tags to short videos, boosting production efficiency and personalized recommendation diversity.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Large-Scale Hierarchical Classification Algorithm for iQIYI Short Videos

In recent years, short‑video platforms have grown explosively, generating massive amounts of user‑generated content (UGC) that puts huge pressure on production systems. One of the key challenges is to assign accurate type tags to each video quickly, which is essential for intelligent distribution. iQIYI’s large‑scale hierarchical classification provides such "type tags" and supports its short‑video recommendation pipeline.

Example: a short video was automatically classified as "Game‑Theme‑Role‑Playing", matching the manual label. The model can even disambiguate IPs like "Marvel" or "Spider‑Man", which are ambiguous between film and game domains, thanks to extensive training data.

Type tags are widely used inside iQIYI. In the short‑video production workflow they improve generation, admission, review and annotation efficiency; in personalized recommendation they become a core feature for diversity control, user profiling, recall, and ranking.

Technical challenges

The classification taxonomy is a manually designed hierarchy with 3–5 levels, nearly 800 leaf categories, and complex mutual‑exclusion and co‑occurrence constraints. Moreover, short videos contain multimodal information (title, description, cover image, video, audio) and contextual signals (source program, uploader), all of which can aid classification.

Solution Overview

The system consists of two main modules: Feature Representation and Hierarchical Classification. Feature representation extracts multimodal embeddings, which are then fed into a hierarchical classifier.

Feature Representation – Text

Four common text modeling approaches are discussed:

BOW (Bag‑of‑Words) – simple, high‑performance when enriched with engineered features.

CNN – captures local n‑gram patterns via convolution and max‑pooling.

RNN (GRU/LSTM) – models sequential dependencies and long‑range context.

Attention – highlights important tokens and can be combined with CNN/RNN.

Our final text encoder combines BOW with a CNN‑plus‑Attention pipeline to balance efficiency and accuracy.

Feature Representation – Image

Cover images provide complementary semantic cues. Three common fusion strategies are evaluated:

Feature extraction from a pre‑trained ImageNet model.

Fine‑tuning the pre‑trained model on the type‑tag task (chosen approach).

End‑to‑end multimodal model fusion.

We fine‑tuned an Xception network, which achieved ~3% higher accuracy than ResNet‑50 with comparable latency.

Feature Fusion

After obtaining vector embeddings for each modality, we experimented with three fusion methods and selected Low‑rank Multimodal Fusion (LMF) for its superior performance:

Concatenation – simple baseline.

CentralNet – multi‑task constrained fusion (≈+1% over concatenation).

LMF – approximates full outer‑product via low‑rank factorization (≈+0.2% over CentralNet).

Hierarchical Classification

Four typical industry approaches are reviewed:

Tree‑of‑independent‑models ("pinball" model).

Cascade strategy – lower‑level predictions feed higher‑level models.

Regularization constraints – enforce similarity between parent‑child parameters.

Multi‑task learning – share parameters across all hierarchy levels.

Our proprietary solution, DHMCN (Dense Hierarchical Multilabel Classification Network), evolves through three versions:

V1 – multi‑task model learning global and leaf representations separately.

V2 – DenseNet‑style dense connections between hierarchy levels to improve convergence.

V3 – cascade‑style weighting where the first‑level prediction guides leaf‑level confidence.

The network jointly optimizes global, local, and fused probabilities using cross‑entropy loss against ground‑truth labels.

Future Work

Incorporate video and audio features for very short clips.

Leverage user search and recommendation session data for low‑resource categories.

Exploit relational signals between videos (same series, same uploader, etc.).

References

[1] S. Dumais and H. Chen. Hierarchical classification of web content. ACM SIGIR, 2000. [2] P.N. Bennett and N. Nguyen. Refined experts: improving classification in large taxonomies. SIGIR, 2009. [3] T. Xiong and P. Manggala. Hierarchical Classification with Hierarchical Attention Networks. KDD, 2018. [4] S. Gopal and Y. Yang. Recursive regularization for large‑scale classification with hierarchical and graphical dependencies. KDD, 2013. [5] J. Wehrmann et al. Hierarchical multi‑label classification networks. ICML, 2018. [6] V. Vielzeuf et al. Centralnet: a multilayer approach for multimodal fusion. ECCV Workshop, 2018. [7] Z. Liu et al. Efficient low‑rank multimodal fusion with modality‑specific factors. CVPR, 2018.

AImultimodalshort videofeature fusioniQIYIhierarchical classification
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.