Artificial Intelligence 10 min read

Face Quality‑Driven Feature Denoising and Fusion for iQIYI‑VID‑2019 Video Person Recognition

The seefun team leveraged face detection scores and quality metrics to denoise and weight‑fuse facial features during training and testing, using a three‑layer MLP with Swish activation and dropout, and achieved a 0.8983 mAP (fourth place) on the iQIYI‑VID‑2019 video person‑recognition challenge.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Face Quality‑Driven Feature Denoising and Fusion for iQIYI‑VID‑2019 Video Person Recognition

The team seefun , formed by a graduate student from Shanghai Jiao Tong University, participated in the 2019 ACM MM & iQIYI Multimodal Video Person Recognition Challenge and achieved a 0.8983 mAP score, ranking fourth. The source code has been released on GitHub.

The iQIYI‑VID‑2019 dataset is the largest multimodal video person‑recognition dataset, containing over 211,000 video clips and 10,034 person categories. Each frame provides a 512‑dimensional face feature vector, detection score, and quality metric, as well as similar features for head, body, and audio. The dataset is split 40% train, 30% validation, 30% test, and the evaluation metric is mAP over the 10,034 classes.

Key challenges include a large open‑set of target classes, noisy face detections, massive data size that cannot be loaded into memory at once, and substantial feature noise from detection errors and low‑quality faces.

Proposed solution :

1. Training‑time feature denoising and augmentation : Faces are filtered by detection score and quality thresholds; selected face features are weighted and fused (randomly sampling 1‑5 faces per video) to create richer, cleaner training vectors. This reduces noise, concentrates class distributions, and improves convergence speed, yielding about a 2% mAP gain.

2. Test‑time video face feature fusion : All face vectors in a video are combined into a single video‑level representation using a quality‑based weighting function W(x). The fused vector is then classified into 10,035 categories (10,034 persons + 1 “other”). This approach reduces inference time by hundreds of times and improves accuracy by ~1% compared with per‑frame voting.

3. Classification model : Multiple three‑layer fully‑connected MLPs were explored. The final model uses Dropout for regularization and the Swish activation function. A single‑fold model achieved 0.8811 mAP; five‑fold ensemble reached 0.8955 mAP, and a planned five‑model ensemble (partially completed) achieved the final 0.8983 mAP.

Additional training tricks include mixed focal loss and softmax loss, learning‑rate warm‑up, cosine annealing, and staged loss weighting to accelerate convergence and improve ranking consistency.

The overall pipeline demonstrates that leveraging face quality metrics for both denoising during training and weighted fusion during testing can substantially improve performance on large‑scale multimodal video person‑recognition tasks while keeping computational cost low.

computer visionMLPfeature fusionface quality weightingiQIYI-VID-2019multimodal video recognition
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.