How Alibaba Cloud Powers AI with Cutting‑Edge Heterogeneous Compute
This article explains how Alibaba Cloud builds a high‑performance AI infrastructure by combining advanced hardware such as Shenlong servers, GPUs, FPGAs, NPUs, and custom interconnects like RDMA, together with virtualization, FPGA‑as‑a‑Service, AIACC, and resource‑pooling technologies to deliver scalable, cost‑effective AI services.
Computing Technology
Under Alibaba Cloud's Shenlong hardware platform, the virtualization architecture has been upgraded to provide a lightweight VM that delivers near‑bare‑metal performance. The Shenlong server features hardware‑direct I/O, storage and network virtualization on MOC cards, and a trimmed‑down Linux OS with a lightweight hypervisor.
GPU
GPU remains the most mature and widely used AI accelerator. Alibaba has performed extensive compiler optimizations for GPU, offers cloud GPU products synchronized with the latest GPU generations, and leads in GPU virtualization, hot‑upgrade, and SR‑IOV based GPU migration.
Training chips continue to push performance with FP32 scaling, TensorCore precision tricks, NVLink high‑speed links, and GPUDirect RDMA. Alibaba Cloud also researches programmable topology splitting on NVSwitch for multi‑GPU communication.
FPGA
FPGA provides ASIC‑level performance with high flexibility but traditionally suffers from long development cycles and a steep learning curve. Alibaba Cloud and AIS co‑developed the industry‑first dual‑chip Xilinx FPGA board, achieving independent innovation at the board and HDK levels.
Shuntian Platform: FPGA as a Service (FaaS)
FaaS provides a unified hardware platform and middleware in the cloud, dramatically lowering accelerator development and deployment costs. Third‑party ISV IP can be offered as a service without users needing to understand the underlying hardware. The platform supplies two development kits, HDK and SDK.
AliDNN
AliDNN is a full‑stack deep‑learning acceleration engine built on FPGA, comprising a custom instruction set, accelerator, SDK, and compiler. It allows TensorFlow, Caffe and other frameworks to call the engine directly, compiling models into accelerator instructions for high‑throughput, low‑latency inference.
NPU
AliNPU (including the “HanGuang 800”) is optimized for CNN workloads while also supporting RNN models, delivering a cost‑performance ratio that surpasses competing AI inference chips.
Interconnect Technology
RDMA
RDMA is a high‑performance network technology that reduces data transfer time and CPU overhead, crucial for AI, scientific computing, and distributed storage. Alibaba’s HAIL network architecture, combined with custom switches, powers the world’s largest RDMA deployment across dozens of data centers, supporting services such as ESSD, SCC, PAI, and POLARDB.
EXSPARCL Communication Library
EXSPARCL (Extremely Scalable and high Performance Alibaba Group Communication Library) offers generic collective communication, compatible with NVIDIA NCCL, and is optimized for large‑scale AI clusters with multi‑NIC support, topology‑aware routing, and congestion‑free algorithms.
AIACC (AI Acceleration Tool)
AIACC provides a unified framework that accelerates TensorFlow, PyTorch, MXNet, and Caffe across VPC and RDMA networks, delivering 1‑10× training speedups without requiring code changes.
Heterogeneous GPU Container Support (cGPU)
The cGPU solution integrates with Kubernetes and NVIDIA Docker, offering transparent GPU sharing and isolation without recompiling applications or replacing CUDA libraries. It supports both memory and compute isolation on NVIDIA devices.
Adapts to open‑source Kubernetes and NVIDIA Docker standards
Transparent to users; no need to recompile or replace CUDA libraries
Provides stable low‑level operations for NVIDIA devices
Supports both memory and compute isolation
Software Pooling (EAIS.EI)
EAIS decouples CPU cores from heterogeneous accelerators via software pooling, allowing CPU‑only ECS instances to dynamically attach or detach GPUs, FPGAs, NPUs, etc., communicating over encrypted gRPC.
HARP: Runtime Layer Pooling
HARP virtualizes accelerator resources, initially focusing on GPUs, to enable transparent, dynamic allocation of local or remote accelerators across containers, VMs, or bare metal, improving utilization and supporting future expansion to NPUs and other chips.
Transparent to upper‑layer applications; works on physical machines, containers, and VMs
Supports both local and remote accelerators with no performance loss locally
Lightweight API for fine‑grained memory and compute control
Hides hardware details and automates configuration
Provides profiling and trace‑replay capabilities
Hardware Layer Pooling
Hardware pooling combines general‑purpose CPUs with various accelerators via high‑speed interconnects, enabling flexible accelerator composition, higher reliability, and better SLA guarantees.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
