Artificial Intelligence 20 min read

Survey of Bandwidth Optimization Techniques in AI Accelerators

This article reviews various architectural strategies—including streaming processing, on‑chip memory optimization, bit‑width compression, sparsity techniques, on‑chip models with chip‑level interconnects, and emerging technologies such as binary networks, memristors, and HBM—to alleviate bandwidth bottlenecks in FPGA/ASIC/TPU AI accelerators.

Tencent Architect
Tencent Architect
Tencent Architect
Survey of Bandwidth Optimization Techniques in AI Accelerators

Yu Xiaoyu, Ph.D., senior researcher at Tencent TEG Architecture Platform, focuses on heterogeneous deep‑learning computation, FPGA cloud, and high‑speed visual perception architecture design and optimization.

1. Overview The first stage of AI acceleration platforms, whether on FPGA or ASIC and for CNN, LSTM, or MLP workloads on embedded or cloud devices (e.g., TPU), all share a core challenge: insufficient memory bandwidth limits utilization despite abundant compute resources.

The literature addresses this issue from several angles:

A. Streaming processing and data reuse

B. On‑chip storage and its optimization

C. Bit‑width compression

D. Sparse optimization

E. On‑chip models and chip‑level interconnects

F. Emerging technologies: binary networks, memristors, and HBM

2. Comparative Techniques and Evolution

2.1 Streaming Processing and Data Reuse

Streaming processing replaces the traditional write‑back‑then‑read pattern with instruction‑level parallel pipelines, allowing each processing element (PE) to receive data directly from its predecessor. This reduces memory‑bandwidth dependence, as illustrated by the contrast between data‑parallel and streaming architectures (Figure 2.1).

In streaming designs, a one‑dimensional pulsating matrix (Figure 2.2 left) enables each PE to read once from memory and write back only after the final PE, dramatically lowering bandwidth requirements. Two‑dimensional pulsating arrays, as used in Google TPU (Figure 2.2 right), support matrix‑matrix and vector‑matrix multiplications with data flowing from the top and left edges of the cell array.

While streaming improves data reuse, it introduces challenges such as data re‑ordering and scaling inefficiencies when the array size exceeds the workload dimensions.

Similar concepts appear in Cambricon’s DianNao series, where multi‑layer PE arrays and fine‑grained compute trees (Figure 2.4) increase utilization and control power consumption.

2.2 On‑Chip Storage and Its Optimization

Off‑chip DRAM offers large capacity but suffers from bandwidth shortage and high energy consumption (Figure 2.6). Two primary remedies are on‑chip caches and near‑storage distribution.

1) Expanding on‑chip caches enables higher data reuse; when matrices fit entirely in cache, each element is loaded once, drastically reducing DRAM traffic, as demonstrated in many AI‑ASIC papers from ISSCC 2016.

2) Near‑storage places multiple small memories close to each PE, increasing aggregate bandwidth (Figure 2.7). Fine‑grained designs further embed private storage within each compute unit (Figure 2.8), while hierarchical schemes such as DaDianNao’s three‑level memory (Figure 2.9) provide central, ring‑distributed, and I/O buffers, allowing full model placement on chip.

2.3 Bit‑Width Compression

Reducing operand precision from 32‑bit to 16‑bit, 8‑bit, or even 1‑bit cuts memory traffic and area. For example, a 16‑bit multiplier occupies only 1/5 of the area of a 32‑bit unit, enabling five‑fold multiplier density. Quantization schemes (linear, logarithmic, non‑linear) and dynamic bit‑width adjustment (Figure 2.10) mitigate accuracy loss, while INT8 support in GPUs and FPGAs yields up to 4× performance gains (Figure 2.11).

2.4 Sparse Optimization

Sparse models arise from inherently sparse algorithms (e.g., NLP) or from pruning dense networks. FPGA and ASIC implementations (Figures 2.12‑2.13) filter out zero values, achieving multiple‑fold speed‑up and power‑efficiency improvements.

2.5 On‑Chip Models and Chip‑Level Interconnect

Storing all weights on chip and using high‑bandwidth interconnects (e.g., HT2.0 with 6.4 GB/s × 4 channels) eliminates DRAM accesses, as demonstrated by DaDianNao’s 36 MB cache and massive PE array, and by multi‑chip FPGA deployments for LSTM inference.

2.6 Emerging Technologies: Binary Networks, Memristors, and HBM

Binary networks convert weights and activations to 1‑bit, enabling logic‑only implementations on FPGA/ASIC. Memristor‑based in‑memory computing performs multiply‑accumulate directly in crossbars (Figure 2.15). 3‑D stacking technologies such as HBM and HMC provide 10‑12× bandwidth over DDR4, with GPUs (P100, V100) and TPUs already integrating HBM2.

3. Conclusion The surveyed techniques collectively address the bandwidth bottleneck in AI accelerators. Combining on‑chip storage, data reuse, quantization, and sparsity yields substantial performance gains, but each method introduces trade‑offs in accuracy, design complexity, and hardware resources, necessitating close collaboration between algorithm and hardware teams.

AIquantizationFPGAASICBandwidthAccelerators
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.