Artificial Intelligence 14 min read

Multimodal Recommendation Algorithms and System Architecture at JD.com

This article presents JD.com’s multimodal recommendation system architecture, covering content understanding, multimodal ranking and recall models, practical deployment pipelines, and future research directions such as large‑model integration and supply‑side generation, all illustrated with detailed diagrams and Q&A.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Multimodal Recommendation Algorithms and System Architecture at JD.com

The presentation, delivered by Tang Ye at DataFunsummit2024, introduces JD.com’s content understanding capabilities and the overall multimodal recommendation workflow, which includes both efficiency‑focused standard pipelines (recall, coarse ranking, fine ranking, and multi‑material mixing) and ToB mechanism pipelines for deterministic, explainable distribution.

Content understanding is divided into three parts: classification tags (topic, interest, genre, style), quality tags (cover, title, video, audio), and semantic representations (title, key‑frame, cover embeddings). Text encoders use Transformer architectures, while image encoders employ EfficientNet.

For multimodal ranking, a MMOE backbone adapts to various domains and material modalities, with sub‑modules handling comprehensive feature usage, cross‑domain interest modeling (CDR), and challenges such as cold‑start and timeliness in live‑stream scenarios.

The GMF framework is introduced, consisting of a DSN residual component that extracts redundant multimodal details from ID embeddings and a personalization module (MIN) that models user preferences across text, ID, and video modalities using softmax‑scaled probabilities.

Multimodal recall combines online and offline stages. Online models build multimodal interest recall progressively from cold‑start to rich interactions, while offline models address user and material cold‑start via tag‑based labeling and interaction mining. CLIP‑based embeddings are used for similarity‑driven retrieval, with a 1:4 positive‑to‑negative sample ratio for training.

The system’s infrastructure includes offline training that generates embeddings and vector vocabularies, which are loaded into an online index for vector‑based and vocabulary‑based retrieval. User behavior sequences and side information are packaged and sent to the index, producing candidate sets that are re‑ranked by the downstream ranking module.

Future directions focus on three areas: (1) enhancing content understanding with large‑model fine‑tuning, knowledge graphs, and RAG‑based LLM solutions; (2) end‑to‑end multimodal encoding to reduce semantic loss while managing model size and computational cost; (3) leveraging generative models to alleviate supply‑side scarcity by generating videos and aggregated materials.

The Q&A section addresses challenges such as sparse conversion labels in video recommendation and mixed‑material ranking strategies, emphasizing list‑wise optimization, preservation of business queue order, and multi‑stage sequence generation and selection.

recommendationAIRankingrecallmultimodalcontent understandingJD.com
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.