Cambricon Achieves Day‑0 Native Support for DeepSeek‑V4, Uniting Two Chinese AI Leaders
Cambricon leveraged its NeuWare stack and vLLM framework to deliver Day‑0 native support for DeepSeek‑V4‑flash (285 B) and DeepSeek‑V4‑pro (1.6 T), open‑sourcing the adaptation and showcasing rapid model migration alongside extreme performance optimizations across software and hardware layers.
Today Cambricon announced that, using the vLLM inference framework, it completed Day‑0 native adaptation for the 285‑billion‑parameter DeepSeek‑V4‑flash and the 1.6‑trillion‑parameter DeepSeek‑V4‑pro models, and has open‑sourced the code on GitHub.
This follows Cambricon’s earlier first‑to‑market adaptation of DeepSeek‑V3.2, reflecting a close partnership built on Cambricon’s self‑developed NeuWare software ecosystem and chip design expertise, and marking a milestone for Chinese AI hardware‑software integration.
The adaptation emphasizes two dimensions: rapid model migration and extreme performance optimization, demonstrating Cambricon’s core technical capabilities.
On the software side, the NeuWare stack fully embraces the open‑source community, providing native support for major AI frameworks such as PyTorch, vLLM and Diffusers, enabling new models to be quickly ported to Cambricon platforms.
In the domestic software ecosystem, Cambricon collaborates deeply with the FlagOS ecosystem to decouple models from specific chip architectures, further lowering migration costs.
For operator development, Cambricon leverages Triton’s strong community compatibility and ease of use to accelerate custom operator adaptation, shortening the functional integration cycle.
Cambricon also introduced the CNAgent code‑generation agent, which automates operator generation and model migration, speeding up the end‑to‑end workflow.
Hardware‑wise, Cambricon chips natively support mainstream low‑precision data formats, allowing functional adaptation and precision verification without extra conversion, and enabling stable Day‑0 operation through tight software‑hardware co‑design.
To unlock the full potential of DeepSeek‑V4, Cambricon built a high‑performance fused‑operator library called Torch‑MLU‑Ops, accelerating modules such as Compressor and mHC, and used the BangC high‑performance language to write optimized kernels for sparse/compressed attention and GroupGemm.
Within the vLLM framework, Cambricon added comprehensive support for 5‑dimensional parallelism (TP/PP/SP/DP/EP), communication‑compute overlap, low‑precision quantization, and PD‑separated deployment. Policy optimizations achieve the best token‑throughput under latency constraints, markedly improving end‑to‑end inference efficiency.
Cambricon further exploited hardware features: MLU memory‑access and sorting acceleration speed up sparse attention and indexer structures; high interconnect bandwidth and low communication latency reduce the communication share in both Prefill and Decode workloads, maximizing distributed inference utilization.
The integrated hardware‑software design continuously reduces compute cost and pushes performance limits for large‑model deployment. Cambricon commits to deepening the large‑model co‑design ecosystem to provide developers and customers with faster, cheaper, and more efficient deployment solutions.
GitHub project: https://github.com/Cambricon/vllm-mlu
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
