Artificial Intelligence 19 min read

How MUSE Revives Long-Tail User Behaviors with Multimodal Search for Lifelong Interest Modeling

MUSE introduces a multimodal search‑based framework that reorganizes tens of thousands of dormant user actions into a unified visual‑semantic interest graph, enabling CTR models to leverage ultra‑long behavior sequences with a 12.6% lift in online performance.

Alimama Tech

Dec 17, 2025

How MUSE Revives Long-Tail User Behaviors with Multimodal Search for Lifelong Interest Modeling

Recommendation systems often suffer from "short‑term amnesia" because historical user actions are truncated or stored as isolated ID codes, leaving valuable long‑tail signals unused. To address this, Alibaba Mama and Wuhan University propose MUSE (Multimodal Search‑based framework), a lifelong user‑interest modeling architecture that integrates visual and textual semantics.

Problem Background

In large‑scale search‑advertising, existing two‑stage pipelines (SIM/TWIN) rely solely on ID embeddings. As behavior histories grow to millions of events, ID‑only retrieval and modeling face two major issues: (1) long‑tail or expired items have weak embeddings, causing poor retrieval; (2) ID‑based attention captures only co‑occurrence, missing semantic similarity, so new but visually similar items are ignored.

Key Insight

Experiments show that a simple multimodal cosine similarity is sufficient for the General Search Unit (GSU), while the Exact Search Unit (ESU) benefits greatly from rich multimodal modeling. High‑quality multimodal embeddings dramatically improve ESU performance, whereas GSU can remain lightweight.

MUSE Architecture

The framework consists of three steps:

Pre‑train multimodal embeddings using SCL (search‑click contrastive learning) that align image semantics with user search‑purchase behavior.

GSU performs a lightweight cosine similarity search over the SCL embeddings to retrieve the top‑K most relevant historical actions from up to 100 K events.

ESU enhances modeling with two parallel paths: SimTier builds a histogram of similarity tiers from the GSU scores, summarizing semantic interest; SA‑TA (Semantic‑Aware Target Attention) fuses ID‑based attention scores with multimodal similarity and their interaction, producing a final attention weight.

SimTier discretizes similarity scores into N bins, counts actions per bin, and creates a compact semantic‑interest vector. SA‑TA computes:

α_Fusion = γ₁·α_ID + γ₂·R + γ₃·(α_ID ⊙ R)

where α_ID is the original DIN target attention, R is the multimodal similarity, and γ are learnable scalars. The concatenated vector from SimTier and SA‑TA becomes the lifelong interest representation fed to the CTR MLP.

Engineering Deployment

To keep latency low despite 100 K‑length sequences, MUSE splits GSU from the ranking path and pre‑fetches embeddings asynchronously during the matching stage, caching them in GPU memory. The ranking service then performs only the cheap cosine Top‑K selection and the lightweight ESU computations, adding negligible overhead.

Results

Online A/B testing on Alibaba Mama’s display ads shows +12.6% CTR, +5.1% RPM, and +11.4% ROI when extending behavior length from 5 K to 100 K and adding multimodal GSU/ESU. Offline ablations confirm that longer sequences yield larger gains and that ESU’s performance is highly sensitive to embedding quality (SCL > I2I > OpenCLIP).

Open Dataset

The team also releases Taobao‑MM, the first public dataset containing long‑behavior sequences with high‑quality multimodal embeddings (128‑dim SCL vectors). The open version provides up to 1 K actions per user, 1 B samples, ~9 M users, and ~35 M items.

Practical Takeaways

Prioritize learning strong item‑level multimodal embeddings before adding complexity to GSU.

Introduce multimodal signals in ESU via SimTier histograms and SA‑TA fusion for substantial gains.

Address I/O bottlenecks by decoupling GSU as an asynchronous service and caching embeddings close to the ranking engine.

Overall, MUSE demonstrates that reorganizing ultra‑long user histories with multimodal semantics can transform dormant logs into valuable predictive signals while maintaining production‑grade latency.

recommendation CTR multimodal lifelong interest modeling Taobao-MM

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.