How GPUs Power AI: From Graphics to GPGPU Explained
This article explores how GPUs evolved from graphics accelerators to general‑purpose processors for AI, detailing the CPU‑GPU heterogeneous architecture, the CUDA programming workflow, compilation into fat binaries, kernel launch mechanics, hardware components, and the differences between SIMD and SIMT models, with performance comparisons and code examples.
Background
GPU originally designed for graphics rendering, later evolved to general‑purpose computing (GPGPU) with programmable shaders (2001) and CUDA (2006), becoming essential for AI workloads.
Graphics Rendering to GPGPU
Graphics tasks are massive parallel operations on millions of pixels, naturally fitting parallel execution. NVIDIA introduced programmable shaders in GeForce 3, enabling developers to write software for the GPU.
CPU/GPU Heterogeneous Architecture
The CPU controls the system and issues commands to the GPU via PCIe using MMIO or DMA. Memory is accessed through MMIO (small transfers) or DMA (large transfers). The CPU also manages unified virtual address space and multiple memory channels.
A Simple Application
Example adds two 1‑billion‑element float arrays on CPU and GPU, measuring execution time.
#include <iostream>
#include <math.h>
#include <chrono>
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] = x[i] + y[i];
}
}
int main() {
int N = 1 << 30;
float *x = new float[N];
float *y = new float[N];
// initialize, run CPU add, measure time
return 0;
}GPU version uses cudaMalloc, cudaMemcpy, kernel launch, and cudaEvent timing.
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error in %s at line %d: %s
", __FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) {
y[index] = x[index] + y[index];
}
}
int main() {
int N = 1 << 30;
size_t bytes = N * sizeof(float);
float *h_x, *h_y, *d_x, *d_y;
// allocation, copy, launch, copy back, check error
return 0;
}Performance Results
CPU add takes ~3740 ms, total program time ~21 s (memory allocation dominates). GPU kernel executes in ~48 ms, total program time ~19 s, showing ~75× speed‑up for the compute kernel but similar overall runtime due to data‑transfer overhead.
Compilation – Fat Binary
nvcc produces host code (compiled by GCC/MSVC) and device code. Device code is emitted as PTX (portable) and SASS (architecture‑specific). Both are packaged into a “fat binary” containing multiple code versions for different GPU architectures.
Program Loading – cubin loading
When a kernel is first called, the CUDA runtime loads the appropriate SASS or JIT‑compiled PTX into GPU memory, creates a CUDA context, maps virtual addresses, and prepares the command buffer.
Program Execution – Kernel Launch
CPU writes launch parameters into a pinned‑memory command buffer, rings a doorbell register, and the GPU DMA engine fetches the commands. The GPU front‑end decodes the kernel launch, distributes thread blocks to SMs, which further split them into warps of 32 threads. Warps are scheduled on the SM’s warp scheduler.
GPU Hardware Architecture
A modern NVIDIA GPU consists of GPCs (Graphics Processing Clusters), each containing TPCs (Texture Processing Clusters), which contain SMs (Streaming Multiprocessors). An SM houses CUDA cores, Tensor cores, registers, shared memory, L1 cache, and a warp scheduler.
Programming Model vs Hardware Execution Model
CUDA exposes a Grid → Thread‑Block → Thread hierarchy. Grids and blocks can be 1‑D, 2‑D, or 3‑D, matching data layout. The kernel computes a global index with blockIdx.x * blockDim.x + threadIdx.x, enabling each thread to process a distinct element.
SIMD vs SIMT
CPU SIMD executes a single instruction on multiple data lanes. CUDA’s SIMT model lets many threads execute the same instruction simultaneously; the hardware still uses SIMD units (warps) but the programmer writes thread‑level code. Divergence occurs when threads in a warp follow different branches, causing serial execution of each path.
Summary
GPU evolution from graphics to GPGPU, CUDA programming workflow, hardware architecture, and execution model are essential for building high‑performance AI infrastructure. Understanding latency hiding, warp scheduling, and memory hierarchy enables effective optimization.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
