Alibaba's FPGA-Based Ultra‑Low Latency, High‑Throughput Machine Learning Processor
Alibaba unveiled an FPGA‑designed machine‑learning accelerator that achieves sub‑millisecond inference latency and thousands of frames‑per‑second throughput, demonstrating how integrated hardware‑software optimizations can deliver real‑time AI performance surpassing conventional GPU and ASIC solutions.
When selecting a machine‑learning processor, online service developers must balance raw compute power with inference latency; evaluating throughput (FPS) at a given latency provides a realistic measure of on‑site performance.
At the HotChips30 conference, Alibaba showcased its research on ultra‑low latency, high‑throughput ML processors and engaged with experts from leading internet and chip companies.
By employing a tightly coupled hardware‑software design, low‑precision and sparsity techniques, and FPGA architecture optimizations, Alibaba’s processor processes a ResNet‑18 image in only 0.174 ms and reaches 5,747 FPS, enabling real‑time AI experiences.
Compared with GPUs, ASICs, and traditional FPGA solutions, GPUs require small batch sizes for low latency, sacrificing throughput; ASICs have long development cycles and lag behind new operators; FPGA offers programmable, customizable hardware that can achieve both low latency and high throughput.
The FPGA architecture features highly efficient instruction scheduling (convolution efficiency >90%), support for low‑precision data types, and CSR compression for sparse parameters, resulting in industry‑leading performance.
Algorithmically, Alibaba introduced a low‑precision training pipeline—regular training, pruning, weight quantization, and feature‑map quantization—yielding high‑accuracy models such as ResNet‑18 on ImageNet.
Empirical results show the Alibaba FPGA processor delivering 0.174 ms latency and 5,747 FPS, whereas a mainstream data‑center GPU achieves a minimum latency of 1.29 ms with only 769 FPS (peak 29.98 ms), highlighting the processor’s superiority.
For agile development, the processor is built as a domain‑specific instruction set; when models change, the compiler generates and loads new instructions, reducing upgrade cycles from months to real‑time.
Alibaba’s technology team demonstrates that co‑optimizing FPGA architecture, algorithms, and compilers can simultaneously improve performance, model accuracy, and flexibility, reflecting the company’s ongoing commitment to infrastructure innovation.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.