Eight Chinese AI Chips Achieve Day‑Zero DeepSeek‑V4 Compatibility
The article explains how eight domestic AI chip makers—Huawei Ascend, Cambricon, HaiGuang, Moore Threads, Kunlun, Pingtouge, Muxi, and Tianshu—simultaneously completed full‑link compatibility, performance tuning, and stability verification for DeepSeek‑V4 on release day, detailing each vendor’s technical path, shared ecosystem breakthroughs, and the broader impact on the AI industry.
Background and Definition of Day0 Adaptation
Day0 adaptation is defined as the completion of full‑link compatibility, performance optimization, and stability verification for a large model on the same day the model is officially released, allowing developers to use the model out‑of‑the‑box without waiting for a separate adaptation window.
Three years ago, Chinese AI chips could only catch up to NVIDIA‑based models after a 3–6 month lag, with performance losses typically exceeding 30 %. The simultaneous Day0 adaptation by eight vendors compresses this lag to zero days and reduces performance loss to near‑zero.
Technical Deep‑Dive of Each Vendor
1. Huawei Ascend
Ascend provides the broadest coverage, supporting both the Pro (1.6 T parameters) and Flash (284 B parameters) versions of DeepSeek‑V4 across the A2, A3, and 950 series. Key optimizations include a fused kernel + multi‑stream parallelism to eliminate the memory‑bandwidth bottleneck of MoE attention, native FP8/MXFP4 support that cuts VRAM usage by over 50 % and doubles compute throughput, and sub‑20 ms inference latency on 950 super‑nodes (20 ms for Pro, 10 ms for Flash) with training reference implementations released.
2. Cambricon
Cambricon continues its rapid adaptation tradition using its self‑developed NeuWare stack and integrates with the vLLM inference framework. The company open‑sourced the adaptation code on GitHub. It built custom Torch‑MLU‑Ops kernels for new modules such as Compressor and mHC, and hand‑wrote sparse‑Attention and GroupGemm kernels in BangC to fully exploit MLU hardware.
3. HaiGuang (DCU)
HaiGuang follows a "model release → chip adaptation → industry deployment" closed loop, focusing on "ready‑to‑use" solutions for commercial workloads. After baseline adaptation, it performed deep tuning of inter‑card interconnect and VRAM scheduling for DeepSeek‑V4’s MoE architecture, ensuring low latency and stable high‑concurrency performance.
4. Moore Threads, Kunlun, Pingtouge
Moore Threads leverages its flagship training‑inference card MTT S5000 with FlagOS, targeting FP8 inference and long‑context memory reuse. Kunlun optimizes the compiler for MoE expert routing, redesigns tensor‑parallel scheduling, and supports 32 GB/64 GB GPU‑class memory clusters. Pingtouge (Zhenwu) implements an independent tensor‑parallel strategy and optimizes o‑group cross‑card communication to lower long‑sequence inference overhead.
5. Muxi & Tianshu
Muxi partners with FlagOS to enable fast model migration and performance tuning, while Tianshu focuses on operator compatibility and precision alignment, keeping output error within 5 % on domestic chips.
Common Breakthroughs Enabling Day0
The eight vendors rely on three shared technological advances provided by the FlagOS 2.0 stack from the Zhiyuan Research Institute:
FlagGems full‑link operator replacement : a self‑developed operator library that removes dependence on CUDA and NVIDIA‑specific libraries, allowing "write once, run on any chip".
Mixed‑precision conversion : a lossless pathway from FP4 + FP8 to FP8/BF16, enabling models originally built for high‑end NVIDIA GPUs to run stably on domestic chips with 32 GB/64 GB VRAM.
Tensor‑parallel strategy reconstruction : a new algorithm that partitions the o‑group structure of DeepSeek‑V4, breaking the traditional 8‑card limit and supporting larger multi‑node deployments.
Industry Implications
For developers, domestic chips have shifted from a "fallback" option to the "first choice"—offering out‑of‑the‑box experience, near‑NVIDIA performance, and lower cost, thereby lowering the barrier to large‑model development and deployment.
For the AI industry, the "national model + national chip" closed loop is now fully formed, providing end‑to‑end autonomous control from model design to inference deployment, and supporting large‑scale AI applications.
Globally, the dominance of NVIDIA’s CUDA ecosystem is being challenged; Chinese chips, with zero‑day adaptation, extreme performance, and open‑source ecosystems, are gaining a foothold in the worldwide AI compute competition.
Personal Reflections
The author argues that the core competitiveness of AI chips lies not in raw compute alone but in the tight co‑design of hardware and software ecosystems. The Day0 success demonstrates a shift from hardware‑centric benchmarking to ecosystem‑centric collaboration, with FlagOS acting as the glue.
Nevertheless, the author cautions that Day0 adaptation is only a starting point; challenges remain in ultra‑large‑scale training and extreme‑performance scenarios, and the ecosystem still needs continuous refinement.
Overall, the collective achievement marks a pivotal moment for China’s AI industry, moving from passive catching‑up to proactive leadership, and paving the way for "national model + national chip" to become the mainstream globally.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
