Artificial Intelligence 5 min read

Alibaba's FPGA-Based Ultra‑Low Latency, High‑Throughput Machine Learning Processor

Alibaba unveiled an FPGA‑designed machine‑learning accelerator that achieves sub‑millisecond inference latency and thousands of frames‑per‑second throughput, demonstrating how integrated hardware‑software optimizations can deliver real‑time AI performance surpassing conventional GPU and ASIC solutions.

Alibaba Cloud Infrastructure

Aug 29, 2018

Alibaba's FPGA-Based Ultra‑Low Latency, High‑Throughput Machine Learning Processor

When selecting a machine‑learning processor, online service developers must balance raw compute power with inference latency; evaluating throughput (FPS) at a given latency provides a realistic measure of on‑site performance.

At the HotChips30 conference, Alibaba showcased its research on ultra‑low latency, high‑throughput ML processors and engaged with experts from leading internet and chip companies.

By employing a tightly coupled hardware‑software design, low‑precision and sparsity techniques, and FPGA architecture optimizations, Alibaba’s processor processes a ResNet‑18 image in only 0.174 ms and reaches 5,747 FPS, enabling real‑time AI experiences.

Compared with GPUs, ASICs, and traditional FPGA solutions, GPUs require small batch sizes for low latency, sacrificing throughput; ASICs have long development cycles and lag behind new operators; FPGA offers programmable, customizable hardware that can achieve both low latency and high throughput.

The FPGA architecture features highly efficient instruction scheduling (convolution efficiency >90%), support for low‑precision data types, and CSR compression for sparse parameters, resulting in industry‑leading performance.

Algorithmically, Alibaba introduced a low‑precision training pipeline—regular training, pruning, weight quantization, and feature‑map quantization—yielding high‑accuracy models such as ResNet‑18 on ImageNet.

Empirical results show the Alibaba FPGA processor delivering 0.174 ms latency and 5,747 FPS, whereas a mainstream data‑center GPU achieves a minimum latency of 1.29 ms with only 769 FPS (peak 29.98 ms), highlighting the processor’s superiority.

For agile development, the processor is built as a domain‑specific instruction set; when models change, the compiler generates and loads new instructions, reducing upgrade cycles from months to real‑time.

Alibaba’s technology team demonstrates that co‑optimizing FPGA architecture, algorithms, and compilers can simultaneously improve performance, model accuracy, and flexibility, reflecting the company’s ongoing commitment to infrastructure innovation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Low latency High Throughput FPGA AI accelerator ResNet18

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.