How Ant’s Multimodal Team Boosted Video‑Text Retrieval by 24% and Cut Copyright Search Costs 85%
This article presents Ant Group's multimodal research on video retrieval, detailing a large Chinese video‑text pre‑training dataset, three techniques that raise video‑text semantic search performance by up to 24.5%, and an end‑to‑end video‑video copyright detection system that reduces storage by 85% and speeds up inference 18‑fold.
Overview
The Ant Group multimodal cognition team shares a year of research on video multimodal retrieval, focusing on two problems: improving video‑text semantic search and enabling efficient video‑video same‑source search.
Video‑Text Semantic Retrieval
Semantic retrieval aims to find videos whose content matches a query text, even when the exact words do not appear in the video description. Typical scenarios include Alipay search and security monitoring.
Three optimization methods were introduced:
Video‑Text Pre‑training – Using a large unsupervised video‑text pair dataset (CNVid‑3.5M) to align modalities before fine‑tuning. On the MSRVTT benchmark, this raised the summed recall (R@sum) by 24.5%.
Hard‑Sample Mining – Two strategies were explored. The first manually adjusts sample weights during curriculum learning; the second lets the model adaptively focus on hard examples via a hard‑sample curriculum loss (HSCL). Experiments on CNVid‑3.5M and an English COCO+VG+CC dataset showed R@sum improvements of about 5%–8%.
Fine‑Grained Modeling – Generated partially masked video‑text pairs to create a partial order (triplet) loss, encouraging the model to distinguish subtle semantic differences. This contributed an additional 2.8% gain in R@sum, and combined with hard‑sample methods achieved up to 4.4% improvement.
The team also released the first publicly available Chinese video‑text pre‑training dataset (CNVid‑3.5M) and a cross‑modal model called SNP‑S3, which enhances video‑text interaction through two components: Mask Significant Semantic Modeling (MSSM) and Local‑Vision‑Word Matching (LVWM).
Hard‑Sample Mining Details
Training data are categorized as good, hard, or noisy samples. Good samples accelerate convergence, hard samples improve performance after the model has learned basic patterns, and noisy samples are discarded. The curriculum learning pipeline first emphasizes good samples, then gradually increases the weight of hard samples using a dynamically adjusted loss weight.
Two concrete methods were implemented:
HSCL – manually defined curriculum based on training stage.
Self‑adaptive hard‑sample mining – using Dual‑Modal Attention Enhancement (DMAE) to enlarge the pool of hard negatives and Negative‑NCE (NegNCE) to force the model to separate hard negatives from positives.
DMAE selects important tokens (nouns, verbs, adjectives with low frequency) on both text and visual sides, while NegNCE adds an auxiliary loss for negatives whose similarity exceeds that of positives. Together they yielded a 3%–6% increase in R@sum.
Fine‑Grained Modeling Details
To create different semantic granularity, the team generated partial‑order pairs by masking the most important tokens (determined by part‑of‑speech and TF‑IDF‑like importance) and by dropping key video frames. The resulting triplet loss enforces that the original pair has higher similarity than the masked version. Experiments showed a 2.8% boost on MSRVTT and a combined 4.4% boost when paired with DMAE.
Video‑Video Same‑Source Retrieval
The goal is efficient copyright detection. The proposed end‑to‑end segment‑matching framework reduces storage by 85% and speeds up retrieval 18×, while improving F1 by 2.78% compared with traditional uniform‑frame methods.
Key challenges include complex transformation types (geometric, optical, temporal) and massive data volume. The system extracts key frames, computes frame‑level features, and stores them in a feature library. During query, key frames are extracted, matched against the library, and a fine‑grained ranking determines infringement.
Two core modules:
SPD (Segment‑Pattern Detection) – Constructs a similarity matrix between query and candidate key‑frame features, then treats high‑similarity contiguous regions as patterns and applies a YOLO‑style detector to locate infringing segments. This replaces costly dynamic‑programming approaches and gains an 18× speedup.
SKE (Self‑Supervised Key‑frame Extraction) – Generates a probability mask for key frames, combines it with a uniform‑frame mask, and jointly trains the mask and SPD module end‑to‑end, allowing gradients to flow to the key‑frame extractor.
The joint training reduces storage by 85% and improves detection F1 by 2.78%.
Framework Diagram
Q&A Highlights
Key‑frame extraction can be trained as a segmentation model with labeled data, or used end‑to‑end with the downstream task to avoid labeling.
The key‑frame model is not yet open‑source but will be released internally.
For recommendation scenarios, video‑text embeddings can be added to recall and ranking features.
Large‑scale feature vectors are stored in a vector database (internal “Qianxun” platform, similar to FAISS).
Hard‑sample mining and fine‑grained modeling are applicable to both video‑text and video‑video tasks.
The research demonstrates substantial gains in both semantic retrieval accuracy and copyright detection efficiency, and the team invites further collaboration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
