Artificial Intelligence 19 min read

Alignment-Uniformity Representation Learning for Zero-shot Video Classification (AURL)

The AURL framework, presented by Pu Shi, introduces alignment‑uniformity aware representation learning for zero‑shot video classification, achieving up to 28 % top‑1 accuracy gains on UCF101 and HMDB51, and has already boosted business metrics in Tencent’s advertising, search, and video‑channel recommendation systems.

Tencent Cloud Developer

Apr 27, 2022

Alignment-Uniformity Representation Learning for Zero-shot Video Classification (AURL)

On April 28, 2022 (14:30‑16:00), a live broadcast titled “Broadcom Technology Practical Zero‑shot Video Classification | CVPR2022” will be held. The speaker is Pu Shi, a researcher at Tencent TaiChi Machine‑Learning Platform.

Video classification is widely used in Tencent’s advertising, search, and recommendation services. Frequent changes in classification taxonomies require models that can quickly adapt. The presented zero‑shot video classification solution (AURL) addresses high labeling costs and long iteration cycles by recognizing new categories without additional labeled data or retraining, and it was selected for CVPR 2022.

The talk consists of two parts: (1) an introduction of the CVPR‑2022 accepted AURL method – Alignment‑Uniformity aware Representation Learning – which jointly learns visual and semantic networks end‑to‑end while constraining the alignment and uniformity of both known and synthetically generated unknown classes, thereby enhancing generalization on unseen categories; (2) a case study of AURL’s deployment in Tencent Advertising, WeChat Search, and WeChat Channels, where it has yielded significant business metric improvements.

Speaker background: Ph.D. (2020) from Beijing University of Posts and Telecommunications, 7 years of experience in video content understanding, multiple papers at NeurIPS, CVPR, TIP, and currently leads the multi‑modal video classification project at the Broadcom Content Understanding team.

Technical overview: Zero‑shot video classification aims to recognize categories without training examples by leveraging labeled data. Two main challenges are the semantic gap between visual and textual features and domain shift when test categories differ from training ones. Existing methods mainly focus on alignment; recent MUFI and ER approaches mitigate domain shift using extra data. AURL proposes an end‑to‑end framework that combines a supervised contrastive loss (alignment term + uniformity term) and a class generator that synthesizes unseen‑class features via interpolation/extrapolation of visual centers and semantic embeddings.

Network architecture: a R(2+1)D backbone extracts visual features, followed by a three‑layer MLP video projector to obtain visual embeddings. Class names are encoded with Word2Vec and passed through a word projector (FC + 3‑layer MLP) to obtain semantic embeddings. The loss includes a SoftPlus‑based alignment component and a LogSumExp‑based uniformity component. The class generator creates synthetic visual and semantic features, and inference uses nearest‑neighbor retrieval.

Evaluation metrics: “Closeness” (average intra‑class distance) measures alignment, while “Dispersion” (minimum inter‑class distance) measures uniformity. Smaller Closeness and larger Dispersion indicate better representation.

Experiments: AURL is trained on Kinetics‑700 and evaluated on UCF101 and HMDB51. Ablation studies demonstrate that each module improves Closeness, Dispersion, and Top‑1 accuracy. Compared with state‑of‑the‑art methods, AURL achieves relative Top‑1 accuracy gains of 28.1 % on UCF101 and 27.0 % on HMDB51, establishing a new benchmark.

Conclusion: The AURL framework successfully learns representations that are both aligned and uniformly distributed, setting a new state‑of‑the‑art for zero‑shot video classification. Its deployment in Tencent’s advertising, search, and video‑channel recommendation systems has already produced notable business impact.

To join the live session, scan the QR code below to register, enter the discussion group for real‑time interaction, and participate in a giveaway.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Deep Learning Alignment representation learning uniformity zero-shot video classification

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.