Design and Performance of a General‑Purpose FPGA CNN Accelerator for Real‑Time AI Services

This article presents a comprehensive overview of a universal FPGA‑based CNN accelerator, detailing its motivation, flexible architecture, compiler workflow, memory and compute unit designs, and performance comparisons that demonstrate significant latency and cost advantages over CPU and GPU solutions for real‑time AI inference.

Tencent Architect
Tencent Architect
Tencent Architect
Design and Performance of a General‑Purpose FPGA CNN Accelerator for Real‑Time AI Services

Author: Derick Wang (王玉伟), a 2014 master graduate of Huazhong University of Science and Technology, focuses on data‑center FPGA heterogeneous computing, with successful projects in deep learning, high‑performance networking, and big‑data acceleration.

Introduction: General‑purpose FPGA CNN acceleration can dramatically shorten development cycles, support rapid iteration of deep‑learning algorithms, and deliver GPU‑level compute performance with orders‑of‑magnitude lower latency, enabling the strongest real‑time AI services.

With explosive growth of internet users and data volumes, data‑center compute demand has surged, outpacing traditional CPU capabilities. Heterogeneous computing, especially CPU+GPU and CPU+FPGA platforms, is seen as the key solution, attracting massive industry investment and maturing programming standards.

Why FPGA? Major players like Microsoft already deploy large numbers of FPGAs for AI inference. FPGA offers:

Flexibility: programmable to adapt to rapidly evolving ML algorithms, supporting DNN, CNN, LSTM, MLP, reinforcement learning, decision trees, arbitrary precision, model compression, sparsity, etc.

Performance: orders‑of‑magnitude lower latency and higher performance‑per‑watt compared with GPU/CPU.

Scale: high‑speed inter‑board I/O and Intel CPU‑FPGA architecture.

However, FPGA development traditionally suffers from long cycles and high entry barriers due to HDL coding. Custom acceleration for a single model can take months, creating tension between algorithm iteration and hardware deployment.

To address this, a universal CNN accelerator was designed. By using a compiler‑generated instruction set, the accelerator can switch models within one to two weeks, dramatically reducing development time.

How it works (Architecture): The workflow converts models trained in Caffe, TensorFlow, or MXNet into optimized instruction streams via a compiler. Image data and model weights are pre‑processed and compressed, then transferred over PCIe to the FPGA. The accelerator executes the instruction buffer, completing a full inference for one image. Each functional module operates independently, with data dependencies and execution order encoded in the instructions.

The compiler focuses on maximizing MAC/DSP efficiency and minimizing memory accesses.

Case Study – GoogLeNet V1: The Inception module combines 1×1, 3×3, 5×5 convolutions and pooling, increasing network width and scale adaptability. The design analyzes data‑dependency to expose pipelining and parallelism, allowing concurrent execution of independent branches and overlapping memory transfers.

Model Optimization: Two aspects are considered: structural optimizations for higher parallelism and dynamic‑precision fixed‑point quantization (int16) that retains near‑float accuracy without retraining.

Memory Architecture: To reduce DDR traffic, a ping‑pong input/output buffer scheme with inner‑copy and cross‑copy operations is employed. For models that cannot fully reside on‑chip, slice and part partitioning splits feature maps, enabling DDR accesses to be overlapped with computation.

Compute Unit Design: Implemented on Xilinx KU115 (two‑die stacked) with 4096 MAC DSP cores running at 500 MHz, achieving a theoretical 4 TFLOPS peak. Two PE groups each contain four 32×16 MAC arrays, focusing on data reuse to lower bandwidth and power consumption.

Application Scenarios & Performance Comparison: For real‑time AI services such as ad recommendation, voice recognition, and video monitoring, FPGA offers superior latency and power efficiency. In a GoogLeNet V1 benchmark, a single KU115 accelerator delivers 16× speedup over a dual‑6‑core CPU (250 ms → 4 ms per image) and reduces total cost of ownership by 90 %. FPGA inference performance slightly exceeds Nvidia P4 GPU while achieving an order‑of‑magnitude latency improvement.

Development Cycle & Usability: The universal accelerator supports rapid deployment of classic models (GoogLeNet, VGG, ResNet, MobileNet) within a day via the compiler. Custom operators may require one to two weeks. An easy‑to‑use SDK abstracts the acceleration process, allowing business logic to invoke simple API calls and switch models in seconds.

Conclusion: FPGA‑based general‑purpose CNN acceleration shortens development cycles, provides GPU‑comparable performance with far lower latency, and forms the backbone of real‑time AI services. Ongoing work extends the platform to RNN/DNN workloads, and the first public‑cloud FPGA server was launched on Tencent Cloud in early 2017, with plans to broaden AI acceleration capabilities.

PerformancecompilerAI inferenceHardware AccelerationFPGACNN acceleration
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.