Artificial Intelligence 10 min read

Tubi's Recall Exploration: Embedding‑Based Candidate Generation for Scalable Video Recommendations

This article details Tubi's multi‑stage recommendation system, focusing on the recall phase and describing how popularity metrics, embedding averaging, per‑video nearest‑neighbors, hierarchical clustering, real‑time ranking, and context‑aware sampling are combined to efficiently generate personalized video candidates at scale.

Bitu Technology
Bitu Technology
Bitu Technology
Tubi's Recall Exploration: Embedding‑Based Candidate Generation for Scalable Video Recommendations

Advertising‑supported video‑on‑demand service Tubi offers a massive library of movies, TV shows, sports and live channels, providing personalized recommendations to millions of monthly active users worldwide.

Tubi runs over 70 machine‑learning models to build its homepage, with the recall stage serving as a lightweight filter that reduces the candidate pool before expensive ranking models.

Most modern recommender systems use a two‑stage process: a recall step narrows millions of items to a manageable set, followed by a final ranking step.

Recall is crucial for Tubi because it lowers candidate set size, saving latency, compute, and storage costs while improving user engagement.

Initially, Tubi ranked the entire catalog for each user offline, but as the user base and content grew, this became computationally prohibitive, prompting a shift toward recall techniques.

Early recall relied on simple popularity metrics (country, language, genre, external ratings) which work well for new users but lack personalization for existing users.

Embedding‑Based Recall

Collaborative filtering via matrix factorization provides user‑item vectors whose inner product predicts preference scores, serving as a basic recall method.

Challenges include the storage cost of user vectors and variable quality of those vectors, leading Tubi to favor item vectors and generate user representations from watch history.

Embedding techniques (Doc2Vec, BERT, etc.) are used to create dense representations of metadata from partners such as IMDB, Gracenote, Rotten Tomatoes, and Wikipedia.

Version 1: Averaged Embeddings – User watch‑history embeddings are averaged to form a single user embedding, enabling fast nearest‑neighbor search but potentially losing nuanced preferences.

Version 2: Per‑Video Nearest Neighbors – Compute nearest neighbors for each video in the watch history, preserving individual video signals at the cost of heavier offline computation.

Version 3: Multimodal User Preferences – Hierarchical clustering of a user's watch history yields cluster centroids that act as compact user representations, reducing storage and improving robustness to outliers.

Version 4: Real‑Time Recall – Transition from batch to real‑time candidate generation using FAISS and HNSW, dramatically cutting unnecessary compute and storage.

Version 5: Context‑Aware Exploration & Sampling – After clustering, compute cluster importance based on size, recent watch time, etc., to guide exploration and sampling for context‑relevant recommendations.

Version 6: Future Challenges – Real‑time ranking enables adaptive clustering, richer user feedback signals, and dynamic importance weighting, paving the way for further innovations.

Authors: Jaya Kawale (VP of Engineering, Machine Learning, Tubi), translator Honghong Zhao, proofreader Shengwu Yang.

machine learningpersonalizationrecallembeddingRecommendation systemsVideo Streaming
Bitu Technology
Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.