How iQIYI’s Multi‑Interest Recall Transforms Video Recommendation
This article analyzes iQIYI’s evolution of multi‑interest recall techniques—from clustering‑based PinnerSage to MOE and single‑activation models—showing how extracting multiple user interests improves recall diversity, mitigates filter bubbles, and boosts key performance metrics in short‑video recommendation.
Technical Background
In video recommendation the recall stage is the first funnel that filters billions of candidates into a manageable set for ranking. Traditional recall produces a single user embedding, which limits the upper bound of ranking performance. Multi‑interest recall extracts several user interest vectors, enabling a “thousand users, many faces” paradigm.
Clustering Multi‑Interest Recall (PinnerSage)
The method reuses existing video embeddings (e.g., node2vec , item2vec ) and applies hierarchical clustering to a user's watched videos. The process consists of two steps:
Clustering – each video starts as its own cluster; the two clusters whose merge causes the smallest increase in intra‑cluster variance are merged iteratively until a variance threshold is reached.
Representative selection – instead of averaging embeddings, the algorithm selects the video embedding that minimizes the sum of distances to all other embeddings in the cluster. These representatives become the user’s interest vectors and are used for ANN search.
This approach avoids repeated ANN queries and reduces information loss compared with naive pooling.
Mixture‑of‑Experts (MOE) Multi‑Interest Recall
The classic two‑tower recall model is extended by replacing the user tower with a Mixture‑of‑Experts (MOE) module that outputs multiple interest vectors. User‑side inputs include sequences of video IDs, uploader IDs and tag IDs; each feature is embedded and average‑pooled, then concatenated and fed to the MOE. The video tower remains a single tower that extracts a video embedding.
Loss computation uses batch‑wise negative sampling (other samples in the same batch serve as negatives) and focal loss to emphasize hard negatives, which is essential because the negative space contains millions of items.
Online deployment showed a 0.64 % increase in overall CTR, a 28 % lift in CTR for the recall source, and a 45 % increase in average watch time.
Single‑Activation Transformer‑Based Multi‑Interest Recall
Inspired by Alibaba’s MIND, the first version used a capsule network to capture multiple interests. Because the capsule architecture was computationally heavy, it was replaced by a transformer that preserves sequence order while being efficient.
Workflow:
Sample a user’s video‑ID sequence {V1,…,VN}; the (N+1)‑th video is the target.
Embed each video ID to obtain E={E1,…,EN} and feed the sequence into a transformer encoder, which outputs K interest vectors M={M1,…,MK}.
Select the interest vector with the highest similarity to the target embedding and compute a sampled‑softmax loss; only this “activated” vector contributes to gradient updates.
During inference, all K interest vectors are used independently for ANN retrieval.
Key improvements:
Disagreement regularization – additional regularization terms (e.g., orthogonality or cosine‑distance penalties) are applied to the set of interest vectors to force diversity and reduce redundancy.
Dynamic capacity – an activation‑record table logs which interest vectors are used during training. At inference time, vectors with low activation counts are pruned, allowing the number of active interests to adapt to each user’s behavior diversity.
Multimodal features – uploader ID and tag embeddings are incorporated. Tag embeddings are pooled across all tags of a video before being concatenated with video and uploader embeddings. The loss still samples negatives only on video‑ID embeddings, keeping the ANN index focused on video IDs.
Summary and Outlook
The multi‑interest recall pipeline has progressed from simple clustering (PinnerSage) to MOE‑based expert towers and finally to a transformer‑driven single‑activation architecture with regularization, dynamic capacity, and multimodal inputs. Production results show overall CTR gains of ~2 %, watch‑time improvements of ~1.5 %, and a 4 % increase in video diversity.
Future work includes:
Enriching behavior sequences with search, subscription and other interaction signals to capture latent interests.
Incorporating negative feedback (dislikes, negative comments, unfollows) into the training objective.
Integrating static user profile features to better align recall with downstream ranking objectives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
