Artificial Intelligence 19 min read

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

JD Retail Technology
JD Retail Technology
JD Retail Technology
JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

In recent years, the rapid rise of domestic AI chips in China has created a critical need for model adaptation, performance optimization, and practical deployment on these chips. JD Retail’s Nine‑Number Algorithm Platform (九数算法中台) addresses this need by building a compatible GPU‑and‑domestic‑NPU end‑to‑end AI engine that spans from the hardware cluster layer to algorithmic engines and multi‑scenario applications.

The platform constructs a thousand‑card scale cluster using high‑performance networking, offering the same scheduling capabilities for domestic NPU as for GPU. A unified API‑driven training and inference engine supports mainstream models with zero‑cost migration from GPU to NPU, while MFU optimization, model quantization, and compilation techniques significantly boost engine performance.

Challenges

Hardware architecture differences : JD’s underlying compute clusters were historically GPU‑centric. Domestic NPU architectures differ markedly, requiring enhanced compatibility and flexible scheduling to fully exploit heterogeneous chip resources.

Immature software ecosystem : Open‑source frameworks lack direct support for domestic NPU, leading to high migration costs for precision verification and performance tuning.

Diverse and complex business scenarios : JD Retail’s varied workloads demand a unified solution that can be flexibly adapted across many use cases.

Overall Technical Architecture

The platform provides a unified GPU/NPU scheduling system, resource‑aware queue mechanisms, and a high‑performance training engine that supports over 40 LLM and multimodal base models. It also offers a high‑performance inference engine with MaaS‑style one‑click deployment, supporting 20+ SOTA models and delivering ~20% speedup over open‑source frameworks.

GPU and NPU Heterogeneous Mixed Scheduling

Key features include:

Thousand‑card cluster with comprehensive visual monitoring, health checks, and automatic fault isolation.

Scheduling optimizations such as NUMA‑aware and network‑topology‑aware placement, resource fragmentation minimization, and configurable priority preemption.

Resource queue mechanisms that guarantee minimum resources while allowing sharing of idle capacity, maximizing NPU utilization.

High‑Performance Training Engine

The engine abstracts APIs so that NPU and GPU users experience zero‑cost, seamless switching. It supports model parallelism, sequence parallelism, low‑precision communication, and compute‑communication fusion, achieving up to 60% MFU on hundred‑card scales and near‑linear scaling for trillion‑parameter models.

Supported model families include LLMs (Qwen, Llama, Baichuan, etc.), multimodal models (SD1.5, SDXL), and many others, as shown in the comprehensive compatibility table.

High‑Performance Inference Engine

Provides MaaS‑style one‑click deployment with OpenAI‑compatible APIs, supports model quantization (W8A8 SmoothQuant, W4A16 AWQ), and includes optimizations such as GE graph compilation, ATB high‑performance operators, and KV‑cache techniques. A visual monitoring and alert system tracks throughput, failure rates, and latency.

Case Studies

Case 1: Video Content Tag‑Cloud Generation – Using Qwen2‑VL on domestic NPU, the system processes multimodal video data to generate keyword tags with performance comparable to GPU.

Case 2: Logistics Large Model – Fine‑tuning Qwen2‑7B for address parsing and classification on NPU achieves 91.03% accuracy, matching GPU results and supporting large‑scale deployment.

Case 3: Merchant‑Side Intelligent Assistant – Fine‑tuned Qwen1.5‑7B on NPU delivers similar tool‑selection outcomes as GPU, with 96% agreement on downstream task routing.

#Input_1
青海省西宁市城北区三其村。可以发圆通吗 谢谢。
#Output-NPU(国产NPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢_UNK,
#Output-GPU(GPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢 _UNK

Application Value

The engine reduces reliance on overseas chips, ensuring security and controllability across the stack. It has been applied to search recommendation, ad creative generation, intelligent customer service, and automated data analysis, providing tangible business impact and feedback for the domestic chip ecosystem.

Future Plans

By 2025, JD Retail aims to build a ten‑thousand‑card cluster with mixed GPU/NPU scheduling, expand chip type support, and continue optimizing scheduling strategies (resource pools, predictive scaling). Further collaboration with domestic chip vendors will deepen AI‑driven digital transformation and contribute to open‑source ecosystems.

model optimizationAIGPUdistributed trainingNPUInference
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.