How JD.com Leverages Domestic NPU Chips to Power Large‑Scale AI Models
This article details JD.com's challenges and solutions for deploying domestic NPU chips across heterogeneous GPU‑NPU clusters, covering architecture, scheduling, high‑performance training and inference engines, real‑world case studies, and future plans to scale AI workloads securely and efficiently.
1. Introduction
With the widespread adoption of large models, AI compute has become a competitive focus. JD.com relies on massive data and needs a solid compute foundation; otherwise algorithms cannot realize their potential.
US export restrictions on high‑end AI chips raise concerns about compute security. Chinese industry associations have urged caution in purchasing US chips and promoted domestic alternatives.
Deploying domestic NPU chips in JD’s business scenarios faces three main challenges: hardware architecture differences, immature software ecosystem, and diverse, complex business requirements.
2. Challenges
2.1 Hardware architecture differences
JD’s compute clusters were historically GPU‑centric. Domestic NPU architectures differ significantly, requiring enhanced compatibility and flexible scheduling to fully utilize heterogeneous chips.
2.2 Software ecosystem immaturity
Open‑source training and inference frameworks lack native support for domestic NPU, leading to high migration costs for precision verification and performance tuning.
2.3 Diverse business scenarios
JD retail presents varied model selections and performance demands, necessitating a unified solution that can be applied across many use cases.
3. AI Engine Technology Based on Domestic Chips
3.1 Overall Architecture
3.2 Heterogeneous GPU‑NPU Scheduling System
The platform builds a thousand‑card cluster with RDMA interconnect, offering unified quota allocation and flexible scheduling for both GPU and domestic NPU. Key features include:
NUMA‑aware and network‑topology‑aware scheduling to maximize resource efficiency.
Resource fragmentation minimization using Gang, BinPack, and node reservation strategies.
Configurable priority eviction to guarantee quota and ensure fairness.
Shared resource queues that provide guaranteed (MIN) and elastic (MAX) capacities, improving overall utilization.
3.3 High‑Performance Training Engine
The training engine supports over 40 mainstream foundation models (LLM, multimodal, text‑to‑image, etc.) with a zero‑cost, seamless switch between GPU and NPU via a unified API. Optimizations include model‑parallel, pipeline‑parallel, low‑precision communication, and compute‑communication fusion, achieving up to 60% MFU on hundred‑card clusters and near‑linear scaling for trillion‑parameter models.
Coverage of 30+ LLM and 10+ multimodal bases.
Full training workflow support, including data, training modes, labeling, evaluation, and 20+ task types.
Deep soft‑hardware co‑optimization (e.g., Triton compilation, CANN fusion) for operators such as flash attention, rotary embedding, and npu_matmul_add_fp32, reaching 60% MFU.
High‑availability features like token pre‑cache and minute‑level asynchronous checkpointing, reducing startup time and storage by over 90% and improving training efficiency by 15%.
3.4 High‑Performance Inference Engine
The inference engine provides MaaS “out‑of‑the‑box” deployment for domestic NPU, compatible with OpenAI and Triton APIs, supporting 20+ industry‑standard LLMs. Performance gains of ~20% over open‑source frameworks are achieved through model quantization (W8A8 SmoothQuant, W4A16 AWQ) and compiler optimizations (GE graph, ATB operators).
Unified deployment with streaming inference.
Support for Baichuan, ChatGLM, Qwen, Llama, and other major models.
Optimized operators (Paged Attention, Flash Attention, Sub_Mul_Concat) and cache techniques (KV cache, Prefix cache) to accelerate inference.
Visual monitoring and alerting for throughput, failure rate, and latency.
4. Deployment Scenarios
Case 1: Video Tag Generation
Using a Qwen2‑VL multimodal model on NPU, JD achieves comparable token output and latency to GPU for video tag cloud generation.
Case 2: Logistics Large Model
Fine‑tuning Qwen2‑7B on NPU for address parsing and classification yields accuracy (≈91%) matching GPU results, now applied in sorting and POI classification tasks.
#Input_1
青海省西宁市城北区三其村。可以发圆通吗 谢谢。
#Output‑NPU(国产NPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢_UNK,
#Output‑GPU(GPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢 _UNKCase 3: Merchant Smart Assistant
Fine‑tuned Qwen1.5‑7B on NPU provides comparable tool‑selection results to GPU, with 96% agreement on downstream task routing.
#Input_1
上架宝贝数怎么看?
#Output‑NPU
{... "tool_name":"business_expert", "query":"如何查看已上架的商品数量?" ...}
#Output‑GPU
{... "tool_name":"business_expert", "query":"如何查看已上架的商品数量?" ...}5. Value and Impact
Domestic‑chip AI engine has been deployed in over ten JD retail scenarios, reducing reliance on foreign chips and enhancing security. It boosts efficiency in search, advertising, intelligent客服, and data analysis, feeding back experience to the domestic chip ecosystem.
JD has co‑built the Openmind community with Huawei Ascend, showcased solutions at industry forums, and received multiple internal awards for the project.
6. Future Plans
By 2025 JD aims to build a ten‑thousand‑card cluster with mixed GPU‑NPU scheduling, expand chip support, and implement intelligent resource prediction and dynamic scaling. Continued collaboration with domestic chip vendors will drive ecosystem growth, while further optimizations target LLM and CTR workloads for both training and inference.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
