Why GPUs Power Large‑Model Inference: From Graphics to GPGPU
This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.
GPU From Graphics Rendering to GPGPU
GPU was originally designed to accelerate graphics rendering, which is a massive parallel task over millions of pixels.
Birth of Programmability (2001)
NVIDIA released GeForce 3 introducing programmable shaders, allowing developers to write software for the GPU's many parallel processing units.
Academic Exploration
Researchers realized the GPU's many cores could perform scientific calculations, leading to the concept of GPGPU, though it required expertise in both graphics and scientific computing.
NVIDIA's Decision
NVIDIA embraced GPGPU, releasing the GeForce 8800 GTX (G80 architecture) in 2006, unifying compute units into a large flexible parallel core array.
In 2007 NVIDIA launched CUDA, a C‑like programming model that lets developers write code for thousands of GPU cores without dealing with graphics APIs.
CUDA has become a key part of the deep‑learning ecosystem.
CPU/GPU Heterogeneous Computing Architecture
The CPU is the system’s master, assigning work to the GPU via PCIe. Communication uses MMIO and two modes: MMIO for small commands and DMA for large data transfers.
MMIO (memory‑mapped I/O) lets the CPU directly read/write small amounts of data mapped into its virtual address space.
DMA (direct memory access) lets the device transfer large data blocks efficiently.
A Simple Application
A demo adds two arrays of 2³⁰ (~1 billion) floats, initializing one array to 1.0 and the other to 2.0, then computing y[i] = x[i] + y[i].
CPU Implementation
#include <iostream>
#include <chrono>
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main() {
int N = 1<<30;
float *x = new float[N];
float *y = new float[N];
for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; }
auto start = std::chrono::high_resolution_clock::now();
add(N, x, y);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "CPU 'add' execution time: " << duration.count() << " ms
";
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
delete[] x; delete[] y;
return 0;
}Performance
Core add function time: 3740 ms
Total program time (real): 21.4 s (mostly memory allocation and initialization)
GPU Implementation
The GPU version follows four steps: allocate memory on host and device, copy data host‑to‑device, launch the kernel, copy results back.
#include <iostream>
#define CUDA_CHECK(call) do { cudaError_t err = call; if (err != cudaSuccess) { fprintf(stderr, "CUDA Error in %s at line %d: %s
", __FILE__, __LINE__, cudaGetErrorString(err)); exit(EXIT_FAILURE); } } while (0)
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) y[index] = x[index] + y[index];
}
int main() {
int N = 1<<30;
size_t bytes = N * sizeof(float);
float *h_x, *h_y;
h_x = new float[N];
h_y = new float[N];
float *d_x, *d_y;
CUDA_CHECK(cudaMalloc(&d_x, bytes));
CUDA_CHECK(cudaMalloc(&d_y, bytes));
for (int i = 0; i < N; i++) { h_x[i] = 1.0f; h_y[i] = 2.0f; }
CUDA_CHECK(cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_y, h_y, bytes, cudaMemcpyHostToDevice));
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, d_x, d_y);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float milliseconds = 0;
CUDA_CHECK(cudaEventElapsedTime(&milliseconds, start, stop));
std::cout << "GPU Kernel 'add' execution time: " << milliseconds << " ms
";
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
CUDA_CHECK(cudaMemcpy(h_y, d_y, bytes, cudaMemcpyDeviceToHost));
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(h_y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
delete[] h_x; delete[] h_y;
CUDA_CHECK(cudaFree(d_x));
CUDA_CHECK(cudaFree(d_y));
return 0;
}Performance
Kernel execution time: 48.7 ms
Total program time (real): 19.4 s
The GPU kernel is about 75× faster than the CPU core, but overall runtime is similar because of PCIe transfer overhead.
Compilation – Fat Binary
nvcc compiles both host code (via GCC/MSVC) and device code. Device code is emitted as SASS (architecture‑specific machine code) and PTX (portable intermediate code). Both are packaged into a fat binary.
Program Loading
The OS loads the executable, the CPU starts host code, and the first CUDA call initializes the runtime library and creates a CUDA context.
First Kernel Launch
The driver matches the GPU’s architecture, selects matching SASS or JIT‑compiles PTX, caches the result, and loads the cubin into GPU memory.
Program Execution – Kernel Launch
CPU writes commands into a pinned‑memory ring buffer, rings a doorbell via MMIO, and the GPU DMA‑copies the buffer and executes commands asynchronously.
CPU Side – cudaMalloc
cudaMalloc synchronously requests GPU memory; the driver allocates physical VRAM, creates a virtual address mapping, and returns a device pointer.
CPU Side – cudaMemcpy / cudaMemset
These operations are submitted through the command buffer and may be synchronous or asynchronous.
CPU Side – Kernel Call
The driver packages kernel launch parameters (grid, block, shared memory, argument pointers) into a command, writes it to the ring buffer, and rings the doorbell.
GPU Side – Command Processing
The GPU fetches commands via DMA, decodes them, distributes thread blocks to SMs, splits blocks into warps, schedules warps, and executes the SASS instructions.
Resource Release
When a thread block finishes, its SM resources are freed; when all blocks finish, the kernel launch completes.
GPU Hardware Architecture
A modern NVIDIA GPU consists of multiple GPCs (Graphics Processing Clusters), each containing several TPCs (Texture Processing Clusters), which in turn contain SMs (Streaming Multiprocessors).
SM Components
CUDA cores (FP32/INT32 ALUs)
Tensor cores (FMA for matrix‑multiply‑accumulate)
Register file, shared memory, L1 cache
Warp scheduler
Memory hierarchy (fast to slow): registers → L1 cache → L2 cache → HBM → host DRAM.
Programming Model vs Hardware Execution Model
CUDA’s programming model maps data to a Grid → Thread Block → Thread hierarchy. The hardware executes Grid → SM → Warp → Core.
Each thread computes a global index: int index = blockIdx.x * blockDim.x + threadIdx.x; enabling coalesced memory accesses.
SIMD vs SIMT
CPU SIMD executes the same instruction on multiple data lanes. CUDA uses SIMT: a warp issues a single instruction, but each thread has its own program counter, allowing divergent control flow via active masks.
Warp Divergence
In pre‑Volta GPUs a warp shared a program counter, causing serial execution of divergent branches. Post‑Volta GPUs give each thread an independent PC and mask, allowing the scheduler to hide latency by switching to ready warps.
Synchronization
__syncthreads()synchronizes all threads in a block; __syncwarp() synchronizes the 32 threads of a warp.
Summary
GPU evolved from graphics to GPGPU.
CPU‑GPU collaboration uses command buffers and doorbells.
CUDA lifecycle: compilation, loading, context creation, kernel launch.
CUDA programming model (Grid‑Block‑Thread) maps to hardware (GPU‑SM‑Warp‑Core).
SIMT hides SIMD complexity; warp‑level and thread‑level latency hiding with synchronization primitives.
Future topics include streams, unified memory, MPS, and performance optimization.
References: https://www.generativevalue.com/p/nvidia-past-present-and-future, https://developer.nvidia.com/blog/even-easier-introduction-cuda, https://hackmd.io/@yaohsiaopid/ryHNKkxTr#GPU-Architecture, https://zhuanlan.zhihu.com/p/31825598174, https://cloud.tencent.com/developer/inventory/16026/article/1891497, https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/, https://github.com/gpu-mode/lectures, https://modal-cdn.com/gpu-glossary/terminal-gh100-sm.svg
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
