Artificial Intelligence 16 min read

Multimodal Algorithms for Content Understanding and Distribution in JD E‑commerce

This article presents JD's multimodal content‑understanding framework, detailing its five‑M business characteristics, the architecture of multimodal recall and ranking models, the GMF and MIN modules for semantic alignment and personalization, and future research directions involving large language models and end‑to‑end multimodal encoding.

DataFunSummit

Aug 22, 2024

Multimodal Algorithms for Content Understanding and Distribution in JD E‑commerce

Introduction Tang Ye, Technical Director of JD Retail Search & Recommendation Content Algorithm Team, introduces the application of multimodal algorithms in JD e‑commerce content. The talk is divided into three parts: an overview of JD's content‑understanding capabilities, multimodal applications in video distribution (recall and ranking), and future research directions.

JD Content Understanding Capability JD's recommendation business exhibits five M's: multi‑source (search and recommendation), multi‑material (video, live, shop, aggregated content), multi‑position (homepage, inner pages, detail pages), multi‑behavior (click, watch, share, comment, add‑to‑cart, favorite), and multimodal (image, text, video). The content‑understanding pipeline includes classification tags, quality tags, and semantic representations using Transformer for text and EfficientNet for images.

Multimodal Ranking The ranking backbone is based on MMOE, with three sub‑modules: comprehensive feature usage, cross‑domain content interest modeling (CDR), and handling challenges such as cold‑start and timeliness. Semantic alignment is achieved by projecting ID embeddings into text and visual spaces, exposing gaps between ID semantics and actual modality semantics. Personalization is modeled via the MIN module, which predicts user interest probabilities for text, ID, and video modalities and fuses them with attention‑pooled multimodal features.

GMF Framework GMF consists of a DSN residual that extracts redundant multimodal details compared to ID embeddings, and a personalization component (MIN) that models user preferences across modalities. The DSN uses a CGGN adversarial network to project ID embeddings into modality spaces, then computes an auto‑difference to isolate fine‑grained details.

Multimodal Recall Recall is split into online and offline stages. Online recall builds multimodal embeddings via CLIP, followed by interaction layers and MLP to predict label probabilities. Positive samples are derived from user SKU‑video and video‑video click pairs; negatives are sampled globally at a 1:4 ratio. Online inference selects seed SKUs from recent user clicks and retrieves candidate videos via vector similarity and index lookup.

Infrastructure and Training Training uses log‑derived samples to learn embeddings, which are stored in vector indexes (V2V, etc.) for online retrieval. At inference time, user ID, behavior sequence, and side information are packaged and sent to the index module for vector and term‑based recall, followed by a top‑k ranking stage.

Future Directions Three main avenues are explored: (1) Content understanding with large language models fine‑tuned for e‑commerce and RAG‑based knowledge retrieval to improve multimodal tag quality; (2) End‑to‑end multimodal encoding to reduce semantic loss between recall and ranking, possibly via a two‑stage approach with a universal dynamic multimodal encoder; (3) Generative supply, using large models to create video and aggregated material to alleviate content scarcity.

Q&A Q1 discusses the sparsity of conversion labels in video recommendation and strategies to shorten the user funnel. Q2 explains the mixed‑ranking pipeline that preserves order across heterogeneous material queues and optimizes list‑wise efficiency based on domain‑specific objectives.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e‑commerce AI content understanding

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.