Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution
To tackle the massive, multimodal tagging challenge of short‑video platforms—characterized by a huge long‑tail tag set, sparse annotations, and uneven modality contributions—the authors propose a two‑stage recall‑ranking system that first retrieves candidates via text, visual, audio and classification cues, then refines them with contrastive learning and extensive hard‑negative sampling, achieving 0.884 tag accuracy in a real‑world news video recommender.
With the rise of short‑video platforms, massive numbers of videos are uploaded daily. Efficient intelligent distribution of these videos relies heavily on accurate video tagging, which helps match content to users' interests and improves watch time and click‑through rate.
Challenges of Video Tagging
Video content is multimodal, containing text, visual, and audio information. The main technical difficulties are:
Large tag vocabulary: Hundreds of thousands of tags with long‑tail distribution make traditional multi‑label classification insufficient.
Low tag coverage: Many relevant tags are missing from the annotation, leading to ambiguous supervision.
Heterogeneous multimodal understanding: Different modalities contribute unequally to each video, requiring comprehensive vector representations.
Technical Solution
A two‑stage recall‑ranking pipeline is proposed (see Figure 1). The recall stage performs coarse‑grained matching to retrieve a broad set of candidate tags, while the ranking stage refines the list with fine‑grained relevance scoring.
Recall Stage
The recall stage aims to retrieve as many relevant tags as possible. Five recall methods are employed:
Text semantic similarity – sentence‑BERT embeddings are used to find similar video titles/descriptions.
Video semantic similarity – frames are extracted, CLIP features are mean‑pooled to obtain video embeddings.
Audio semantic similarity – VGGish extracts audio embeddings from background music.
Multi‑label classification – a multimodal multi‑label classifier (Figure 2) predicts tags directly; the model incorporates tag embeddings and a video‑tag similarity branch.
Multimodal semantic similarity – fused text, visual, and audio embeddings are used for similarity search.
The recall performance of each method and the overall recall are reported in Table 1, showing that multimodal similarity and multi‑label classification achieve the highest individual recall, while the combined approach reaches an overall recall of 0.874.
Ranking Stage
The ranking stage focuses on precisely ordering the recalled tags. The model (Figure 3) uses contrastive learning to pull relevant tags closer to the video embedding and push irrelevant tags away. Video and audio features are aggregated with NextVLAD, and lightweight Tiny‑BERT extracts textual features for fast online inference.
Negative‑sample construction is critical. Experiments compare three strategies:
Simple negative samples (negative_num=10) – yields decent separation but struggles with fine‑grained distinctions.
Increased negative samples (negative_num=400) – improves overall ranking accuracy.
Adding hard negatives via a “2‑hop” method – further refines the ranking, especially for closely related tags.
Ranking results under the three settings are summarized in Table 2, demonstrating that the combination of many negatives and hard negatives produces the most accurate top‑5 predictions for a basketball‑related video.
Effect Evaluation
The proposed pipeline has been deployed in NetEase News' video recommendation system. Table 3 compares three configurations: a pure multimodal classification model, a full‑ranking model (no recall stage), and the proposed multi‑stage recall‑ranking model. The latter achieves the highest tag accuracy (0.884) with a moderate average number of tags per video (3.12).
Conclusion and Outlook
This work presents a multimodal video multi‑tag modeling approach that leverages a recall‑ranking framework to improve tag prediction accuracy. Deployed in a real‑world news video scenario, it yields notable business impact. Future directions include incorporating knowledge graphs to further enhance model generalization.
References
[1] Reimers & Gurevych, “Sentence‑BERT: Sentence embeddings using siamese BERT‑networks,” arXiv:1908.10084, 2019. [2] Radford et al., “Learning transferable visual models from natural language supervision,” ICML, 2021. [3] Hershey et al., “CNN architectures for large‑scale audio classification,” ICASSP, 2017. [4] Lin Rongcheng, Xiao Jing, and Jianping Fan, “NextVLAD: An efficient neural network to aggregate frame‑level features for large‑scale video classification,” YouTube‑8M Workshop, 2018. [5] Lan et al., “ALBERT: A lite BERT for self‑supervised learning of language representations,” arXiv:1909.11942, 2019.
NetEase Media Technology Team
NetEase Media Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.