Tubi's Recall Exploration: Embedding‑Based Candidate Generation for Scalable Video Recommendations
This article details Tubi's multi‑stage recommendation system, focusing on the recall phase and describing how popularity metrics, embedding averaging, per‑video nearest‑neighbors, hierarchical clustering, real‑time ranking, and context‑aware sampling are combined to efficiently generate personalized video candidates at scale.
Advertising‑supported video‑on‑demand service Tubi offers a massive library of movies, TV shows, sports and live channels, providing personalized recommendations to millions of monthly active users worldwide.
Tubi runs over 70 machine‑learning models to build its homepage, with the recall stage serving as a lightweight filter that reduces the candidate pool before expensive ranking models.
Most modern recommender systems use a two‑stage process: a recall step narrows millions of items to a manageable set, followed by a final ranking step.
Recall is crucial for Tubi because it lowers candidate set size, saving latency, compute, and storage costs while improving user engagement.
Initially, Tubi ranked the entire catalog for each user offline, but as the user base and content grew, this became computationally prohibitive, prompting a shift toward recall techniques.
Early recall relied on simple popularity metrics (country, language, genre, external ratings) which work well for new users but lack personalization for existing users.
Embedding‑Based Recall
Collaborative filtering via matrix factorization provides user‑item vectors whose inner product predicts preference scores, serving as a basic recall method.
Challenges include the storage cost of user vectors and variable quality of those vectors, leading Tubi to favor item vectors and generate user representations from watch history.
Embedding techniques (Doc2Vec, BERT, etc.) are used to create dense representations of metadata from partners such as IMDB, Gracenote, Rotten Tomatoes, and Wikipedia.
Version 1: Averaged Embeddings – User watch‑history embeddings are averaged to form a single user embedding, enabling fast nearest‑neighbor search but potentially losing nuanced preferences.
Version 2: Per‑Video Nearest Neighbors – Compute nearest neighbors for each video in the watch history, preserving individual video signals at the cost of heavier offline computation.
Version 3: Multimodal User Preferences – Hierarchical clustering of a user's watch history yields cluster centroids that act as compact user representations, reducing storage and improving robustness to outliers.
Version 4: Real‑Time Recall – Transition from batch to real‑time candidate generation using FAISS and HNSW, dramatically cutting unnecessary compute and storage.
Version 5: Context‑Aware Exploration & Sampling – After clustering, compute cluster importance based on size, recent watch time, etc., to guide exploration and sampling for context‑relevant recommendations.
Version 6: Future Challenges – Real‑time ranking enables adaptive clustering, richer user feedback signals, and dynamic importance weighting, paving the way for further innovations.
Authors: Jaya Kawale (VP of Engineering, Machine Learning, Tubi), translator Honghong Zhao, proofreader Shengwu Yang.
Bitu Technology
Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.