Generative Large‑Model Architecture for JD Advertising: Practices, Challenges, and Optimization
JD’s advertising platform replaces rule‑based recall with a generative large‑model pipeline that unifies e‑commerce knowledge, multimodal user intent, and semantic IDs across recall, coarse‑ranking, fine‑ranking and creative optimization, while meeting sub‑100 ms latency and sub‑¥1‑per‑million‑token cost through quantization, parallelism, caching, and joint generative‑discriminative inference, delivering double‑digit performance gains and paving the way for domain‑specific foundation models.
Overview
In JD’s advertising platform, the retrieval (recall) stage is critical. Traditional rule‑based recall lacks flexibility and fails to capture diverse user needs, while large generative models offer new opportunities but introduce training cost and privacy challenges.
Key Points of the Talk
At the AICon Global AI Development & Application Conference, JD’s Algorithm Director Zhang Zehua presented “JD Advertising Large‑Model Application Architecture Practice”, sharing solutions and lessons for applying large models in advertising.
Generative Retrieval System
The system integrates world knowledge, JD’s e‑commerce data, multimodal product understanding, and user‑intent recognition, coupled with efficient model training and inference pipelines. By quantifying product semantics, performing generative decoding for recall, and optimizing inference performance, recall efficiency is significantly improved.
Three‑Stage Pipeline
From a classic advertising workflow, generative techniques are applied in three stages:
Recall & coarse‑ranking – an information‑retrieval problem that “creates” candidate items from massive data.
Fine‑ranking – CTR/CVR models filter and rank candidates.
Information‑completion – multimodal understanding and re‑ranking (creative optimization) refine top results.
Data Representation
Semantic ID is introduced as a unified representation for user behavior and e‑commerce knowledge, enabling the model to understand both structured (product attributes) and unstructured (user‑generated images, comments) data.
Engineering Challenges
Two major challenges dominate industrial deployment:
Low latency: inference must stay below ~100 ms, otherwise the result is discarded.
High throughput & cost control: a million‑token inference should cost less than ¥1, otherwise large‑model deployment is not viable.
Optimization Layers
Optimization is tackled at three levels:
Single‑node: quantization (FP8, lower‑bit), tensor parallelism, advanced attention (Flash/Page Attention), and dynamic latency batching.
Distributed: soft‑hardware co‑design, KV‑Cache pooling, model graph splitting, multi‑level CPU‑RAM/GPU‑HBM caching.
Full‑link: edge pre‑computation, near‑line and offline computation to steal latency from the critical path.
Joint Generative & Discriminative Inference
JD rewrites the generative inference flow in TensorFlow, integrates it with traditional sparse CTR/CVR models, and shares hidden states between the two, achieving a unified inference graph that avoids HBM bottlenecks.
Results & Outlook
The generative approach has been applied across recall, coarse‑ranking, fine‑ranking, creative bidding, and re‑ranking, delivering double‑digit performance gains. Future directions include domain‑specific foundation models for e‑commerce, deeper fusion of generative and discriminative models, and continued co‑design of algorithms and systems.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.