Artificial Intelligence 22 min read

Disney’s M5 Model: Multi‑Modal, Multi‑Interest, Multi‑Scenario Boost for Streaming Recommendations

Disney’s Content Discovery team introduces M5, a multi‑modal, multi‑interest, multi‑scenario recall model that enhances VOD and live streaming recommendations by leveraging rich metadata, user behavior, and contextual features, outperforming baseline methods with significant hit‑ratio gains across Hulu and Disney+.

Hulu Beijing

Aug 19, 2022

Disney’s M5 Model: Multi‑Modal, Multi‑Interest, Multi‑Scenario Boost for Streaming Recommendations

With the growth of network infrastructure, platforms such as YouTube, Facebook, TikTok, and Netflix serve massive daily traffic. Subscription‑based streaming services (e.g., Netflix, Hulu, Disney+) provide video‑on‑demand (VOD) and live content, relying on page‑level recommendation systems to personalize content and drive user satisfaction and business metrics.

Modern recommendation pipelines follow a multi‑stage cascade: recall → coarse ranking → fine ranking → re‑ranking. The recall stage, which retrieves a small set of candidates from a large pool, is often the bottleneck because it determines the overall candidate quality.

Early recall methods used collaborative filtering. With deep learning, most industrial systems now adopt dual‑tower models that generate user and item embeddings and compute preferences via inner products. Online inference typically employs nearest‑neighbor search over user embeddings.

Challenges Specific to Disney Streaming

Rich heterogeneous metadata (ids, genres, series, brands, actors, visual and textual features) that existing methods do not fully exploit.

Multi‑interest users: subscribers may watch VOD series, movies, and live events, showing both coarse‑grained and fine‑grained diverse interests.

Multi‑scenario platform: recommendations must serve different subscription packages, regions, and matrix‑style page layouts, requiring scenario‑aware modeling.

Limited video inventory: Disney offers a curated set of high‑quality originals, allowing precise scoring without approximate retrieval.

To address these, we propose M5 (Multi‑Modal Multi‑Interest Multi‑Scenario Matching), a recall model designed for Disney’s streaming services.

Problem Formulation

Given a user u and a candidate set, the recall stage should return the top N videos that best match the user’s interests, formalized as maximizing the predicted preference score f(u,i) for each video i .

Feature Types

User features: age, gender, statistical counts of watched categories.

Behavior features: high‑frequency implicit watch actions and low‑frequency explicit likes/dislikes, aggregated at the episode level with de‑duplication and enriched with fine‑grained descriptors.

Context features: device, hour, date, and recency of the last action.

Target features: same as behavior item features.

Model Architecture

The overall M5 architecture follows a dual‑channel dual‑tower design. At the bottom, a multimodal embedding layer encodes each episode id into two embeddings: an ID‑based embedding and a content‑graph (CG) embedding derived from a pre‑trained graph that incorporates visual, textual, and categorical metadata.

Both user and item embeddings are computed in parallel for the ID and CG channels. User embeddings are generated by a multi‑interest extraction layer and a multi‑scenario fusion layer; item embeddings are looked up from ID and CG embedding tables. A dynamic weighting layer then merges the multimodal predictions into a unified recall score.

Multimodal Embedding

ID embedding is randomly initialized or incrementally updated, while CG embedding is initialized from a pre‑trained content graph that connects IDs with tags, actors, visual and textual nodes. Visual and textual representations are obtained from pre‑trained models (e.g., ResNet, BERT) and trained with Word2Vec and GraphSAGE; newer methods were explored but did not yield further gains.

Shared Embedding

All non‑episode features (e.g., user demographics, context) use standard shared embeddings identical to conventional approaches.

Multi‑Interest Extraction

User behavior sequences are first processed by a self‑attention transformer (multi‑layer bidirectional) to capture diverse interests. For the CG channel, only the Subsidiary‑Intensity (SIN) module aggregates embeddings to preserve graph semantics.

Subsidiary‑Intensity (SIN) Module

SIN, inspired by DIN’s local activation unit, applies point‑wise attention to score each behavior. It cross‑features episode embeddings with auxiliary features, feeds the result into an MLP with an exponential activation to ensure non‑negative scores, and initializes the final layer to output 1 for all behaviors, allowing the network to learn importance gradually.

Multi‑Scenario Fusion

We employ a Sparse‑Mixture‑of‑Experts (SMoE) extension of MMoE to capture scenario‑specific patterns. Scenario IDs are added as features and combined with other inputs via inner products. Expert networks learn shared knowledge, while a disagreement loss encourages diversity among experts.

Dynamic Weighting

For each user–video pair, M5 computes separate scores from the ID and CG channels and learns weights based on user features to produce a final weighted score, enabling smooth and accurate retrieval even for large candidate sets.

Experiments

We trained and evaluated M5 on one month of data from Hulu and Disney+. Offline experiments using Hit Ratio (HR) show that M5 outperforms several strong baselines by more than 10% on Hulu and 5% on Disney+ across all scenarios. Ablation studies confirm that multimodal, multi‑interest, and multi‑scenario components each contribute orthogonal improvements.

Online A/B tests on Hulu and Disney+ using Hours per Visitor (HPV) as the metric demonstrate that M5 significantly surpasses production baselines (a feature‑rich YouTube‑style DNN for Hulu and a variational auto‑encoder model for Disney+), leading to full‑scale deployment and continued business growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Streaming recommendation systems user modeling Multi-Modal M5 model

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.