Artificial Intelligence 7 min read

My Experience and Methods in the iQIYI Multimodal Person Recognition Challenge

In the iQIYI Multimodal Person Recognition Challenge, I leveraged the provided facial features, weighted face‑quality averaging, DBSCAN‑based noise clustering and a dynamic extra noise class within an iterative KNN‑to‑neural‑network training pipeline, ultimately reaching the top‑5 and open‑sourcing the full workflow on GitHub.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
My Experience and Methods in the iQIYI Multimodal Person Recognition Challenge

Last year I participated for the first time in the iQIYI Multimodal Person Recognition Challenge. My research focuses on machine learning and computer vision, so I decided to join the competition alone. The competition was intense and reaching the top 5 was difficult.

Because my computing resources were limited and the official dataset already provided extracted facial features, my initial strategy was to use these face features for person identification. The main way to improve model performance was to work on noisy data.

We applied the DBSCAN clustering algorithm and added an extra dynamic noise class during training, which enhanced the model's robustness to noisy samples. The whole workflow is illustrated in the original paper and the code has been open‑sourced on GitHub: https://github.com/luckycallor/IQIYI_VID_5th .

The challenges were twofold: how to represent video persons and how to improve the model's ability to recognize noise. For the first problem we used a weighted average of face quality scores to represent a person, which speeds up computation. For the second problem we introduced additional noise categories:

1. Use DBSCAN to cluster noisy data; the resulting clusters form the first part of the extra class, while unclustered noise forms the second part.

Training procedure:

Ø Initialize noise samples that do not belong to any cluster by randomly assigning them to one of 4,934 other noise categories.

Ø Train the model.

Ø After each epoch, predict the "other noise" data with the current model, obtain a predicted label l.

Ø If l < 4,934, update the label to l + 8,652; otherwise keep the label unchanged.

We iterated the model four times: first with a k‑Nearest Neighbors baseline, then with a neural network, followed by training with the added noise data, and finally an ensemble of models.

Although the final ranking dropped from third to fifth place, the experience highlighted the importance of patience, continuous model tuning, and teamwork. Simple baselines (e.g., KNN on face features) are useful for establishing a baseline, after which targeted improvements can be made. Larger models and GPU resources are needed for extracting new features from raw video.

The iQIYI‑VID dataset is a large‑scale video‑based person recognition dataset with rich multimodal information (scenes, audio, actions) and accurate annotations. The 2019 version adds about 5,000 short‑video IDs and expands to 10,000 celebrity identities, 200 hours of video, and 200,000 clips, making it a challenging benchmark for future research.

My background: Master’s student at Nanjing University, research interests include machine learning, computer vision, GANs, and face recognition. I placed in the top 5 of the 2018 iQIYI challenge. More details are available on my personal blog https://luckycallor.xyz and GitHub https://github.com/luckycallor .

computer visionmultimodalmachine learningiQIYIperson recognitionDBSCANnoise handling
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.