Artificial Intelligence 6 min read

How DFX Achieves Low-Latency Multi-FPGA Acceleration for Transformer Text Generation

The article reviews the DFX system—a multi‑FPGA server that uses model‑parallelism and a ring‑topology interconnect to accelerate GPT‑2 text generation, showing 3.78× higher throughput, 3.99× better energy efficiency, and 8.21× greater cost‑effectiveness compared with a four‑GPU V100 baseline.

Network Intelligence Research Center (NIRC)

Jun 24, 2023

How DFX Achieves Low-Latency Multi-FPGA Acceleration for Transformer Text Generation

Motivation

AI inference can run on CPU, GPU, FPGA, ASIC. FPGA provides fast, flexible, efficient hardware‑programmable acceleration, enabling task‑specific optimizations while retaining flexibility. The paper “DFX: A Low‑latency Multi‑FPGA Appliance for Accelerating Transformer‑based Text Generation” proposes a multi‑FPGA accelerator for GPT‑2 text generation.

Architecture

DFX is a server composed of two CPUs and a homogeneous cluster of four Xilinx Alveo U280 FPGA cards. Each FPGA hosts a compute core, giving four cores in total. The FPGAs connect to the host via a PCIe Gen 3 x16 subsystem; inter‑FPGA communication uses QSFP transceivers in a ring topology, each FPGA providing two QSFP ports.

Intra‑layer model parallelism splits the multi‑head‑attention weight matrix by heads and the fully‑connected layer weight matrix by columns across the four FPGAs. Each FPGA stores its partition locally, performs the same operations, and produces a sub‑vector of the output. Sub‑vectors circulate around the ring for synchronization; after synchronization each core holds the complete vector and proceeds to the next layer.

The compute core comprises three units:

Matrix processing unit: matrix multiplication and masked matrix multiplication.

Vector processing unit: softmax, layer‑normalization, and residual connections.

DMA unit: maximizes HBM bandwidth for different parameter types (weights, biases, keys, values, etc.).

A lightweight router reduces the overhead of data synchronization after each parallel matrix‑multiplication step.

Experimental Evaluation

Evaluation used an Intel Xeon Gold 6226R CPU with four U280 cards. The server motherboard provides 20 PCIe Gen3 x16 slots, allowing scaling by adding more FPGA cards or duplicating the cluster.

Each U280 operated at 200 MHz kernel frequency and 410 MHz memory‑interface frequency, achieving resource utilizations of 39.93 % LUT, 42.52 % FF, 59.13 % BRAM, 10.83 % URAM, and 39.15 % DSP. The design is written in SystemVerilog, synthesized with Xilinx Vivado, and uses Xilinx Vitis 2020.2 for host‑FPGA communication.

The baseline consists of four Nvidia Tesla V100 32 GB HBM GPUs in the same server.

Results

With four FPGA cards, DFX achieves an average 3.78× higher throughput and 3.99× better energy efficiency than the V100 baseline. Performance scales linearly with the number of FPGAs, increasing by approximately 1.5× for each additional FPGA. Cost‑effectiveness is measured at 8.21× that of the GPU solution.

Reference: Hong, Seongmin, et al. “DFX: A Low‑latency Multi‑FPGA Appliance for Accelerating Transformer‑based Text Generation.” 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022.

Paper: https://arxiv.org/pdf/2209.10797.pdf

Slide: https://hc34.hotchips.org/assets/program/posters/HC2022.KAIST.SeongminHong.v03.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance transformer Hardware Acceleration cost efficiency FPGA GPT-2

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.