How iQIYI’s Multi‑Interest Recall Transforms Video Recommendation

This article analyzes iQIYI’s evolution of multi‑interest recall techniques—from clustering‑based PinnerSage to MOE and single‑activation models—showing how extracting multiple user interests improves recall diversity, mitigates filter bubbles, and boosts key performance metrics in short‑video recommendation.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
How iQIYI’s Multi‑Interest Recall Transforms Video Recommendation

Technical Background

In video recommendation the recall stage is the first funnel that filters billions of candidates into a manageable set for ranking. Traditional recall produces a single user embedding, which limits the upper bound of ranking performance. Multi‑interest recall extracts several user interest vectors, enabling a “thousand users, many faces” paradigm.

Clustering Multi‑Interest Recall (PinnerSage)

The method reuses existing video embeddings (e.g., node2vec , item2vec ) and applies hierarchical clustering to a user's watched videos. The process consists of two steps:

Clustering – each video starts as its own cluster; the two clusters whose merge causes the smallest increase in intra‑cluster variance are merged iteratively until a variance threshold is reached.

Representative selection – instead of averaging embeddings, the algorithm selects the video embedding that minimizes the sum of distances to all other embeddings in the cluster. These representatives become the user’s interest vectors and are used for ANN search.

This approach avoids repeated ANN queries and reduces information loss compared with naive pooling.

Mixture‑of‑Experts (MOE) Multi‑Interest Recall

The classic two‑tower recall model is extended by replacing the user tower with a Mixture‑of‑Experts (MOE) module that outputs multiple interest vectors. User‑side inputs include sequences of video IDs, uploader IDs and tag IDs; each feature is embedded and average‑pooled, then concatenated and fed to the MOE. The video tower remains a single tower that extracts a video embedding.

Loss computation uses batch‑wise negative sampling (other samples in the same batch serve as negatives) and focal loss to emphasize hard negatives, which is essential because the negative space contains millions of items.

Online deployment showed a 0.64 % increase in overall CTR, a 28 % lift in CTR for the recall source, and a 45 % increase in average watch time.

Single‑Activation Transformer‑Based Multi‑Interest Recall

Inspired by Alibaba’s MIND, the first version used a capsule network to capture multiple interests. Because the capsule architecture was computationally heavy, it was replaced by a transformer that preserves sequence order while being efficient.

Workflow:

Sample a user’s video‑ID sequence {V1,…,VN}; the (N+1)‑th video is the target.

Embed each video ID to obtain E={E1,…,EN} and feed the sequence into a transformer encoder, which outputs K interest vectors M={M1,…,MK}.

Select the interest vector with the highest similarity to the target embedding and compute a sampled‑softmax loss; only this “activated” vector contributes to gradient updates.

During inference, all K interest vectors are used independently for ANN retrieval.

Key improvements:

Disagreement regularization – additional regularization terms (e.g., orthogonality or cosine‑distance penalties) are applied to the set of interest vectors to force diversity and reduce redundancy.

Dynamic capacity – an activation‑record table logs which interest vectors are used during training. At inference time, vectors with low activation counts are pruned, allowing the number of active interests to adapt to each user’s behavior diversity.

Multimodal features – uploader ID and tag embeddings are incorporated. Tag embeddings are pooled across all tags of a video before being concatenated with video and uploader embeddings. The loss still samples negatives only on video‑ID embeddings, keeping the ANN index focused on video IDs.

Summary and Outlook

The multi‑interest recall pipeline has progressed from simple clustering (PinnerSage) to MOE‑based expert towers and finally to a transformer‑driven single‑activation architecture with regularization, dynamic capacity, and multimodal inputs. Production results show overall CTR gains of ~2 %, watch‑time improvements of ~1.5 %, and a 4 % increase in video diversity.

Future work includes:

Enriching behavior sequences with search, subscription and other interaction signals to capture latent interests.

Incorporating negative feedback (dislikes, negative comments, unfollows) into the training objective.

Integrating static user profile features to better align recall with downstream ranking objectives.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningrecommendationiQIYIvideo recommendationmulti-interest recall
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.