Artificial Intelligence 16 min read

Kunlun Chip XPU Architecture, Software Stack, and Programming Model Overview

Kunlun Chip’s XPU‑R architecture combines high‑performance SDNN and Cluster compute units, 512 GB/s GDDR6 memory, and PCIe 4.0 interconnect, supported by an LLVM‑based software stack, CUDA‑like programming model, and seamless PaddlePaddle integration, enabling efficient AI training and inference with significant cost and performance gains.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Kunlun Chip XPU Architecture, Software Stack, and Programming Model Overview

Kunlun Chip Technology was invited to Baidu Technical Salon #99 for the "Intelligent Chip" session, where four experts presented the one‑year achievements of Kunlun Chip, highlighting its role as a domestic AI computing foundation.

1. Kunlun Chip Hardware Architecture

The latest generation Kunlun Chip XPU‑R architecture consists of four main parts: compute, storage, interconnect, and interfaces. The compute part includes SDNN (software‑defined neural network engine) for tensor operations and Cluster units for general‑purpose computation. Storage comprises high‑bandwidth GDDR6 (512 GB/s) and on‑chip shared memory. Interconnect provides high‑speed chip‑to‑chip links for large‑scale distributed training, and the interface supports PCIe 4.0 (compatible with PCIe 3.0).

The XPU‑R design merges SDNN and Cluster, delivering both high performance and flexibility. It features 8 Cluster units and 6 SDNN units, offering up to 128 TFLOPS@FP16. On‑chip L3 SRAM (64 MB) and external GDDR6 serve as hierarchical memory, with L3 SRAM delivering lower latency and higher bandwidth than external memory.

2. Kunlun Chip Software Stack

The software stack includes a runtime environment, development kit, high‑performance acceleration libraries, communication library, and graph compilation library. The runtime provides low‑level drivers, user‑mode APIs, and tools such as monitors, debuggers, and profilers, supporting multi‑stream, SR‑IOV virtualization, and event synchronization on x86‑64 and Arm64 platforms.

The simulator offers cycle‑accurate modeling of compute and storage units, along with debugging and profiling tools, enabling seamless switching between hardware and simulation.

The development kit is an LLVM‑based toolchain with a custom Clang front‑end and XPU back‑end, supporting AOT and JIT compilation, device/host separation, and includes compiler‑rt libraries, assembler, linker, and debugging utilities.

The high‑performance DNN library provides multi‑threaded, multi‑stream APIs for common operators (e.g., matrix multiplication, convolution, pooling, activation). The communication library implements broadcast, reduce, data compression, topology detection, and fault tolerance. The graph compilation library, built on TVM, offers C++/Python interfaces for model import from PaddlePaddle, TensorFlow, PyTorch, etc., and performs graph‑level optimizations before deployment.

3. Programming Model

The Kunlun XPU programming model follows a CUDA‑like paradigm with kernels, host‑side launch parameters, and explicit memory transfers between host and device via PCIe. It supports events and streams for overlapping computation and data movement. Each Cluster contains 64 cores, each with 8 KB local memory and shared 256 KB memory, enabling SIMD instructions and specialized operations.

Memory hierarchy: Local Memory → Shared Memory → L3 SRAM → Global Memory (GDDR6). Developers write device kernels in an extended C++ syntax, using constructs such as xpu_malloc for memory allocation and triple‑angle‑bracket launch syntax.

4. Integration with AI Frameworks

Kunlun Chip AI accelerators are integrated with major deep‑learning frameworks, especially Baidu PaddlePaddle. Since 2018, Kunlun Chip support has progressed from Paddle Lite inference to full training support in PaddlePaddle 2.0 and later versions. Integration involves minimal code changes (often a single line) to switch the backend, enabling mixed CPU‑XPU inference, graph optimizations, and custom operator generation.

Distributed training and large‑scale deployment are also supported, with the ecosystem covering PaddleCV, PaddleNLP, and other PaddlePaddle components.

5. Case Study

A real‑world industrial quality‑inspection case demonstrates replacing manual visual inspection and GPU‑based PyTorch pipelines with a Kunlun‑Paddle solution, achieving a 65 % cost reduction and 9 % performance improvement.

6. Summary

Kunlun XPU delivers high performance and energy efficiency for diverse AI workloads, offering a flexible and easy‑to‑use programming model, extensive software tools, and end‑to‑end ecosystem integration across hardware, OS, and AI frameworks. Over 20,000 chips have been deployed, supporting 100 % domestically developed technology.

deep learningHardware ArchitecturePaddlePaddleAI chipProgramming Modelsoftware stackXPU
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.