Why General‑Purpose CPUs Are Inefficient for Deep Learning: Heterogeneous Computing and AI Processor Design
The article analyzes the limitations of general‑purpose CPUs for deep‑learning workloads, explains how semiconductor scaling and memory‑bandwidth constraints drive the shift toward specialized heterogeneous processors such as GPUs, FPGAs, and ASICs, and discusses the design trade‑offs of embedded versus cloud AI accelerators.
Yu Xiaoyu, Ph.D., senior researcher at Tencent TEG Architecture Platform, focuses on deep‑learning heterogeneous computing, FPGA cloud, and high‑speed visual perception architecture design and optimization.
Overview – General‑Purpose CPUs Are Inefficient
CPUs have been the indispensable core of computers, but their dominance in compute platforms is waning due to two main factors: intrinsic constraints and shifting workload demands.
Intrinsic constraints include semiconductor process limits and memory‑bandwidth bottlenecks. After the 7 nm node, Moore’s law is fading, preventing CPUs from gaining performance through higher transistor density while keeping power constant. To achieve higher performance and lower power, designers reduce generality, creating GPUs and custom ASICs that excel at specific tasks.
Moreover, CPU cores require large data volumes, but off‑chip DDR memory offers limited bandwidth and high latency. On‑chip caches mitigate this only partially; they occupy a tiny fraction (<1 %) of silicon area, and compatibility constraints further hinder architectural evolution.
Demand shift is driven by two emerging compute‑intensive scenarios: cloud‑scale big‑data analytics and deep learning. In deep‑learning workloads (e.g., CNNs), model depth and accuracy growth demand massive compute density and data reuse, which CPUs cannot deliver efficiently.
Figure 1.1 shows the trend of deeper, more accurate deep‑learning models and the corresponding rise in compute requirements.
Because CPUs lack the specialized parallelism and bandwidth needed for deep‑learning inference and training, heterogeneous computing becomes essential.
A classic comparison plots programmability/flexibility against development difficulty/customization, compute efficiency, and power consumption, positioning CPUs, GPUs, FPGAs, and ASICs on this plane.
Figure 1.2 illustrates the criteria for selecting a compute platform.
CPUs offer maximal flexibility at the cost of compute efficiency. GPUs focus on graphics and massive data parallelism, employing thousands of cores and large distributed caches, but their high‑bandwidth memory (HBM) still consumes significant power compared to FPGAs and ASICs. ASICs achieve the highest efficiency and lowest power for specific applications, yet their design, verification, and manufacturing costs are prohibitive for rapidly evolving deep‑learning models.
Consequently, many AI processor designs aim for a domain‑specific processor—an FPGA/ASIC hybrid that can efficiently run a class of models (e.g., CNNs) while retaining some programmability.
Section 2 – Embedded vs. Cloud AI Processors
AI processors have proliferated into two major categories: cloud‑side and edge‑side solutions. A publicly maintained list (https://basicmi.github.io/Deep-Learning-Processor-List/) catalogs these designs.
Figure 1.1 – Deep‑learning processor solution list.
Figure 1.3 depicts the evolution and design goals of AI processors.
The early AI processors targeted embedded front‑ends with modest model sizes. As models deepened, bandwidth (I/O) bottlenecks emerged, prompting solutions such as larger on‑chip caches, optimized scheduling, and low‑precision quantization. In the cloud era, massive parallelism intensified I/O pressure, leading to the adoption of High‑Bandwidth Memory (HBM) and HMC, which dramatically increase on‑chip storage density and alleviate I/O limits.
Two development stages are identified:
Stage 1 – Solving I/O bandwidth constraints (early AI processors, first‑generation TPUs, FPGA solutions, Cambricon ASICs).
Stage 2 – Solving compute scalability (leveraging HBM/HMC to enable full‑model on‑chip execution and massive parallelism).
2.2 Bandwidth Bottleneck
In Stage 1, increasing parallel compute units (e.g., a multiply‑accumulate unit running at 500 MHz) demands several gigabytes per second of memory bandwidth. A high‑end FPGA (Xilinx KU115) with 5 520 DSPs would require ~22 TB/s, far exceeding the ~19.2 GB/s of a DDR4 DIMM. Solutions include higher cache reuse, shared caches, model simplification, low‑bit quantization, and sparsity.
Figure 1.4 – Bandwidth calculation for a single MAC unit.
2.3 Compute Scaling
On‑chip caches, while providing high bandwidth, occupy 1/3–2/3 of chip area, limiting the ability to scale compute proportionally. Figures 1.5 illustrate cache area ratios in Google’s first‑generation TPU (37 %) and Cambricon’s DiaoNao ASIC (66.7 %).
Figure 1.5 – On‑chip cache proportion in TPU and DiaoNao ASIC.
Stacked memory technologies such as HBM expand storage density from megabytes to gigabytes and boost bandwidth >50×, removing I/O as the primary bottleneck and enabling full‑model placement on chip. However, HBM’s high‑cost process limits its use to large internet and semiconductor companies.
Figure 1.6 – HBM and vertical chip‑stacking technology.
Future designs will focus on efficient compute architectures, scalable compute scales, and distributed computing capabilities to handle massive data and frequent interactions during training.
The discussion will be split into two follow‑up articles: (1) “Heterogeneous Acceleration for Deep Learning – Part 2: Diverse Solutions under Bandwidth Constraints” and (2) “Industrial Scale Compute Power Release”.
References:
[1] Wang Kui, “CPU and GPU Both Inefficient After Moore’s Law – The TPU‑Era AlphaGo Re‑emerges”, Xinzhiyuan, 2017.
[2] Jeff Dean, “Keynote: Recent Advances in Artificial Intelligence via Machine Learning and the Implications for Computer System Design”, Hot Chips 2017.
Tencent Architect
We share insights on storage, computing, networking and explore leading industry technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.