Multimodal Short Video Content Tagging Techniques and Applications at iQIYI
The article surveys iQIYI’s multimodal short‑video content‑tagging pipeline, detailing extraction‑ and generation‑based methods, challenges of open‑world tags, model evolution from rule‑based to Transformer generators, visual‑text fusion techniques, and applications such as recommendation, search, clustering, and future enhancements.
Natural Language Processing (NLP) is a key branch of artificial intelligence that enables machines to understand human language. In the context of short‑video platforms, content‑tag technology is an essential means for content understanding. This article introduces multimodal short‑video content‑tag techniques and their practical deployment at iQIYI.
The article is organized into five parts: (1) What is a content tag, (2) Methods for extracting content tags, (3) Challenges of multimodal short‑video tagging, (4) The evolution of the underlying models, and (5) Main application scenarios.
A content tag is a representation of a piece of media (text, image‑text, or short video) expressed by keywords or phrases generated from the content itself. It differs from a type tag, which is a pre‑defined classification scheme.
Content tags are used for three typical scenarios: personalized recommendation, search (matching user queries with tags), and clustering/classification (using tags as features to improve clustering or classification performance).
Tag extraction methods are divided into two major categories: extraction‑based and generation‑based. Extraction‑based methods select keywords or phrases that appear in the source text, while generation‑based methods produce tags even if they do not appear in the original content.
Extraction‑based approaches include supervised and unsupervised techniques. Supervised methods first generate candidate words (e.g., via frequency‑based labeling) and then rank them using classifiers. Unsupervised methods rely on word‑frequency statistics such as TF‑DF or graph‑based algorithms like TextRank and its variants (ExpandRank, CiteTextRank, PositionRank). Joint learning can also be employed to avoid error accumulation between candidate generation and ranking.
Generation‑based approaches use seq2seq frameworks to map the input text to a sequence of tags, allowing the model to generate tags that are absent from the original text. Reinforcement‑learning methods have been explored (e.g., ACL 2019) where recall is used as a reward when the generated tag set is below the ground‑truth and F1 score when it exceeds the ground‑truth.
Short‑video tagging faces several difficulties: (1) the tag set is open and can contain millions of items, (2) annotation standards are inconsistent (inter‑annotator agreement ≈ 22 %), (3) a large proportion of tags are abstract (absence tags) and often do not appear in titles, and (4) multimodal information (cover image, video frames) is required to resolve ambiguities.
The model evolution at iQIYI progressed from simple word‑weight + threshold rules, to CRF models, attention‑based extractors, and finally Transformer‑based generators, which provide stronger semantic abstraction capabilities.
To incorporate visual information, transfer learning is used. Pre‑trained image models (ResNet, Inception‑v3, Xception) are fine‑tuned on a high‑frequency abstract‑tag classification task. The Xception model is selected, and the penultimate 2048‑dimensional vector is taken as the image representation.
Three fusion strategies are explored: (1) concatenating the image vector as a token to the text input, (2) using the image vector to initialize encoder hidden states, and (3) merging the image vector with the encoder output to serve as the decoder’s initial state.
BERT vectors are integrated in a similar fashion: the title is encoded by BERT, the second‑last layer’s vector is extracted, and then combined with the visual vectors using the same three fusion points, which enhances the model’s ability to generate abstract tags.
For full multimodal fusion, key frames are sampled from the video, transformed into vectors by Xception, and then combined with cover‑image vectors, text vectors, and BERT vectors. The combined representation is fed to a generative model to produce tags.
Fusion can be performed at three levels: early data‑level concatenation, late decision‑level aggregation (e.g., averaging scores), and model‑layer fusion (e.g., HybridFusion and enhanced scaled‑dot‑product attention). These techniques substantially improve tag generation quality.
Application examples at iQIYI include: (1) personalized recommendation by matching user interest tags with content tags, (2) improving search relevance and query expansion through tag‑based matching, (3) measuring term tightness via co‑occurrence in tags, (4) query recommendation, (5) linking short clips to their long‑video sources, (6) IP association (games, products, literature) via entity tags, and (7) event aggregation through tag‑based recall.
Future work aims to enhance annotation quality, incorporate richer audio‑visual signals, and further boost model precision for short‑video tagging.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.