How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators

This article explains why migrating AI applications from Nvidia GPUs to domestic graphics cards is urgent, outlines the technical challenges, and introduces JoyScale’s zero‑perception migration stack that enables end‑to‑end hardware, software, and model adaptation for reliable, high‑performance AI deployment.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How to Seamlessly Migrate AI Workloads from Nvidia GPUs to Domestic Accelerators

Urgency and Necessity of Migration

In today's digital era, AI technology is reshaping life and work, but its rapid adoption depends on powerful compute resources, with GPUs being the core hardware. Geopolitical tensions and increasing competition make reliance on a single GPU supplier untenable, prompting a shift to domestic GPUs to ensure the security and sustainability of China's AI industry.

International Challenges

Recent U.S. restrictions on high‑tech exports, especially bans on advanced AI chips, have severely impacted China's AI sector. On December 3, 2024, four major Chinese semiconductor associations issued a statement urging firms to prudently purchase U.S. chips and expand cooperation with other regions, highlighting the urgency of achieving autonomous AI chip capabilities.

Need for Technological Autonomy

Dependence on imported chips carries supply risks, potential technology lock‑outs, and security threats. The rise of domestic AI chips offers new options; migrating AI workloads to domestic GPUs can reduce foreign reliance, ensure autonomous control, and protect national information security.

Domestic Market Potential

China's AI market is vast, spanning smart security, autonomous driving, medical imaging, and fintech. Domestic GPUs are continuously improving and now possess the performance needed to replace imported chips, meeting diverse market demands while providing a broad runway for domestic chip development.

What Makes Migration Difficult?

The core pain point is the lack of an end‑to‑end migration toolchain and solution built for domestic GPUs, which would allow algorithm engineers to switch compute resources without code changes.

JoyScale “Zero‑Perception” Migration Stack

JD Cloud’s JoyScale heterogeneous compute management platform, refined through extensive internal and external deployments, has completed migration for over 40 mainstream models and offers a full‑stack solution with four guiding principles:

Zero Intrusion : No changes to algorithm code; migration is achieved by backend switching.

Verifiable : Each step is benchmarked against a GPU baseline, with quantifiable and rollback‑able errors.

Scalable : New chips are added via a plug‑in approach while keeping the core framework unchanged.

Full‑Link : Covers training, fine‑tuning, inference, and online monitoring end‑to‑end.

3.1 System Architecture

图片
图片

3.2 Migration Solution

Hardware Adaptation

Accelerator Scheduling Adaptation : Develop scheduling plug‑ins for domestic GPU interconnects, e.g., Ascend 910B’s HCCS architecture requires pods to reside within the same HCCS ring.

Operator Support Analysis : Use tools like PyTorch Profiler to extract GPU operators, compare them with the API list supported by domestic GPUs, and develop adapters for unsupported operators.

Performance Tuning : Profile each operator’s execution time, optimize slow operators by aligning data, converting to contiguous memory, and leveraging vendor APIs for operator fusion or sub‑graph submission.

Software Adaptation

Program Migration : Replace CUDA calls with domestic‑accelerator equivalents, e.g., torch.cuda.xxx()torch.npu.xxx().

Framework Optimization : Provide a unified API layer so that NPU and GPU users can switch training backends without code changes or cost.

Model Adaptation

Model Quantization : Reduce model compute and storage requirements to improve runtime efficiency on domestic GPUs.

Soft‑Hard Co‑Optimization : Use Triton compilation, CANN fusion, and other techniques to finely tune hot operators (e.g., flash attention, rotary_embedding, npu_matmul_add_fp32), implement attention slicing, dynamic input stitching, full‑graph dispatch, and independent scheduling, achieving up to 60% MFU on hundred‑card clusters and near‑linear scaling for trillion‑parameter models.

Inference Optimization

Apply GE graph compilation and ATB high‑performance operators to deeply optimize operations such as Paged Attention, Flash Attention, and Sub_Mul_Concat, supporting W8A8 SmoothQuant and W4A16 AWQ quantization to reduce compute and memory traffic.

Deploy dual‑backend hot‑standby model services, gradually rolling out domestic compute from 5% to 100% traffic with automatic rollback to Nvidia GPUs if failure rate exceeds 0.1%.

Unified Scheduling and Monitoring

Develop a cloud‑native, ten‑thousand‑card heterogeneous scheduling system that auto‑detects CPU NUMA and network topology, ensuring tasks are placed on optimal compute and network resources, employing gang scheduling and resource pooling to maximize cluster utilization.

Provide visual monitoring of GPU/NPU utilization, memory usage, and AI service metrics such as throughput, failure rate, latency, and token counts.

Typical Deployment Scenarios

Retail : Multimodal models analyze product videos, extracting tags. Migration from Nvidia GPUs to domestic NPUs yields comparable accuracy and similar response latency.

Intelligent Customer Service : Large‑model agents fine‑tuned on domestic compute produce analysis results similar to Nvidia‑based models, with 96% of issues routed to the same downstream processing paths.

Logistics : Address parsing models fine‑tuned on domestic accelerators achieve 91.03% accuracy versus 91.08% on Nvidia GPUs; AI pre‑sorting is deployed in multiple provinces, handling over 30,000 abnormal addresses daily.

Conclusion

Migrating AI applications from Nvidia GPUs to domestic GPUs is not optional but essential for the security and sustainable growth of China's AI industry. The earlier the migration, the longer the window of opportunity. JD Cloud’s JoyScale offers a mature, complete migration stack that lowers cost, boosts efficiency, and lets customers focus on algorithm innovation while jointly advancing domestic compute capabilities.

Model Optimizationheterogeneous computingAI migrationdomestic GPUsJoyScale
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.