Artificial Intelligence 31 min read

How GPUs Power AI: From Graphics to GPGPU Explained

This article explores how GPUs evolved from graphics accelerators to general‑purpose processors for AI, detailing the CPU‑GPU heterogeneous architecture, the CUDA programming workflow, compilation into fat binaries, kernel launch mechanics, hardware components, and the differences between SIMD and SIMT models, with performance comparisons and code examples.

Tencent Cloud Developer

Jul 8, 2025

How GPUs Power AI: From Graphics to GPGPU Explained

Background

GPU originally designed for graphics rendering, later evolved to general‑purpose computing (GPGPU) with programmable shaders (2001) and CUDA (2006), becoming essential for AI workloads.

Graphics Rendering to GPGPU

Graphics tasks are massive parallel operations on millions of pixels, naturally fitting parallel execution. NVIDIA introduced programmable shaders in GeForce 3, enabling developers to write software for the GPU.

CPU/GPU Heterogeneous Architecture

The CPU controls the system and issues commands to the GPU via PCIe using MMIO or DMA. Memory is accessed through MMIO (small transfers) or DMA (large transfers). The CPU also manages unified virtual address space and multiple memory channels.

A Simple Application

Example adds two 1‑billion‑element float arrays on CPU and GPU, measuring execution time.

#include <iostream>
#include <math.h>
#include <chrono>

void add(int n, float *x, float *y) {
    for (int i = 0; i < n; i++) {
        y[i] = x[i] + y[i];
    }
}

int main() {
    int N = 1 << 30;
    float *x = new float[N];
    float *y = new float[N];
    // initialize, run CPU add, measure time
    return 0;
}

GPU version uses cudaMalloc, cudaMemcpy, kernel launch, and cudaEvent timing.

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error in %s at line %d: %s
", __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

__global__ void add(int n, float *x, float *y) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        y[index] = x[index] + y[index];
    }
}

int main() {
    int N = 1 << 30;
    size_t bytes = N * sizeof(float);
    float *h_x, *h_y, *d_x, *d_y;
    // allocation, copy, launch, copy back, check error
    return 0;
}

Performance Results

CPU add takes ~3740 ms, total program time ~21 s (memory allocation dominates). GPU kernel executes in ~48 ms, total program time ~19 s, showing ~75× speed‑up for the compute kernel but similar overall runtime due to data‑transfer overhead.

Compilation – Fat Binary

nvcc produces host code (compiled by GCC/MSVC) and device code. Device code is emitted as PTX (portable) and SASS (architecture‑specific). Both are packaged into a “fat binary” containing multiple code versions for different GPU architectures.

Program Loading – cubin loading

When a kernel is first called, the CUDA runtime loads the appropriate SASS or JIT‑compiled PTX into GPU memory, creates a CUDA context, maps virtual addresses, and prepares the command buffer.

Program Execution – Kernel Launch

CPU writes launch parameters into a pinned‑memory command buffer, rings a doorbell register, and the GPU DMA engine fetches the commands. The GPU front‑end decodes the kernel launch, distributes thread blocks to SMs, which further split them into warps of 32 threads. Warps are scheduled on the SM’s warp scheduler.

GPU Hardware Architecture

A modern NVIDIA GPU consists of GPCs (Graphics Processing Clusters), each containing TPCs (Texture Processing Clusters), which contain SMs (Streaming Multiprocessors). An SM houses CUDA cores, Tensor cores, registers, shared memory, L1 cache, and a warp scheduler.

Programming Model vs Hardware Execution Model

CUDA exposes a Grid → Thread‑Block → Thread hierarchy. Grids and blocks can be 1‑D, 2‑D, or 3‑D, matching data layout. The kernel computes a global index with blockIdx.x * blockDim.x + threadIdx.x, enabling each thread to process a distinct element.

SIMD vs SIMT

CPU SIMD executes a single instruction on multiple data lanes. CUDA’s SIMT model lets many threads execute the same instruction simultaneously; the hardware still uses SIMD units (warps) but the programmer writes thread‑level code. Divergence occurs when threads in a warp follow different branches, causing serial execution of each path.

Summary

GPU evolution from graphics to GPGPU, CUDA programming workflow, hardware architecture, and execution model are essential for building high‑performance AI infrastructure. Understanding latency hiding, warp scheduling, and memory hierarchy enables effective optimization.

AI CUDA GPU hardware architecture GPGPU

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.