Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

GPU From Graphics Rendering to GPGPU

GPU was originally designed to accelerate graphics rendering, which is a massive parallel task over millions of pixels.

Birth of Programmability (2001)

NVIDIA released GeForce 3 introducing programmable shaders, allowing developers to write software for the GPU's many parallel processing units.

Academic Exploration

Researchers realized the GPU's many cores could perform scientific calculations, leading to the concept of GPGPU, though it required expertise in both graphics and scientific computing.

NVIDIA's Decision

NVIDIA embraced GPGPU, releasing the GeForce 8800 GTX (G80 architecture) in 2006, unifying compute units into a large flexible parallel core array.

In 2007 NVIDIA launched CUDA, a C‑like programming model that lets developers write code for thousands of GPU cores without dealing with graphics APIs.

CUDA has become a key part of the deep‑learning ecosystem.

CPU/GPU Heterogeneous Computing Architecture

The CPU is the system’s master, assigning work to the GPU via PCIe. Communication uses MMIO and two modes: MMIO for small commands and DMA for large data transfers.

MMIO (memory‑mapped I/O) lets the CPU directly read/write small amounts of data mapped into its virtual address space.

DMA (direct memory access) lets the device transfer large data blocks efficiently.

A Simple Application

A demo adds two arrays of 2³⁰ (~1 billion) floats, initializing one array to 1.0 and the other to 2.0, then computing y[i] = x[i] + y[i].

CPU Implementation

#include <iostream>
#include <chrono>
void add(int n, float *x, float *y) {
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}
int main() {
    int N = 1<<30;
    float *x = new float[N];
    float *y = new float[N];
    for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; }
    auto start = std::chrono::high_resolution_clock::now();
    add(N, x, y);
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    std::cout << "CPU 'add' execution time: " << duration.count() << " ms
";
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    std::cout << "Max error: " << maxError << std::endl;
    delete[] x; delete[] y;
    return 0;
}

Performance

Core add function time: 3740 ms

Total program time (real): 21.4 s (mostly memory allocation and initialization)

GPU Implementation

The GPU version follows four steps: allocate memory on host and device, copy data host‑to‑device, launch the kernel, copy results back.

#include <iostream>
#define CUDA_CHECK(call) do { cudaError_t err = call; if (err != cudaSuccess) { fprintf(stderr, "CUDA Error in %s at line %d: %s
", __FILE__, __LINE__, cudaGetErrorString(err)); exit(EXIT_FAILURE); } } while (0)
__global__ void add(int n, float *x, float *y) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) y[index] = x[index] + y[index];
}
int main() {
    int N = 1<<30;
    size_t bytes = N * sizeof(float);
    float *h_x, *h_y;
    h_x = new float[N];
    h_y = new float[N];
    float *d_x, *d_y;
    CUDA_CHECK(cudaMalloc(&d_x, bytes));
    CUDA_CHECK(cudaMalloc(&d_y, bytes));
    for (int i = 0; i < N; i++) { h_x[i] = 1.0f; h_y[i] = 2.0f; }
    CUDA_CHECK(cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_y, h_y, bytes, cudaMemcpyHostToDevice));
    cudaEvent_t start, stop;
    CUDA_CHECK(cudaEventCreate(&start));
    CUDA_CHECK(cudaEventCreate(&stop));
    CUDA_CHECK(cudaEventRecord(start));
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add<<<numBlocks, blockSize>>>(N, d_x, d_y);
    CUDA_CHECK(cudaEventRecord(stop));
    CUDA_CHECK(cudaEventSynchronize(stop));
    float milliseconds = 0;
    CUDA_CHECK(cudaEventElapsedTime(&milliseconds, start, stop));
    std::cout << "GPU Kernel 'add' execution time: " << milliseconds << " ms
";
    CUDA_CHECK(cudaEventDestroy(start));
    CUDA_CHECK(cudaEventDestroy(stop));
    CUDA_CHECK(cudaMemcpy(h_y, d_y, bytes, cudaMemcpyDeviceToHost));
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(h_y[i] - 3.0f));
    std::cout << "Max error: " << maxError << std::endl;
    delete[] h_x; delete[] h_y;
    CUDA_CHECK(cudaFree(d_x));
    CUDA_CHECK(cudaFree(d_y));
    return 0;
}

Performance

Kernel execution time: 48.7 ms

Total program time (real): 19.4 s

The GPU kernel is about 75× faster than the CPU core, but overall runtime is similar because of PCIe transfer overhead.

Compilation – Fat Binary

nvcc compiles both host code (via GCC/MSVC) and device code. Device code is emitted as SASS (architecture‑specific machine code) and PTX (portable intermediate code). Both are packaged into a fat binary.

Program Loading

The OS loads the executable, the CPU starts host code, and the first CUDA call initializes the runtime library and creates a CUDA context.

First Kernel Launch

The driver matches the GPU’s architecture, selects matching SASS or JIT‑compiles PTX, caches the result, and loads the cubin into GPU memory.

Program Execution – Kernel Launch

CPU writes commands into a pinned‑memory ring buffer, rings a doorbell via MMIO, and the GPU DMA‑copies the buffer and executes commands asynchronously.

CPU Side – cudaMalloc

cudaMalloc synchronously requests GPU memory; the driver allocates physical VRAM, creates a virtual address mapping, and returns a device pointer.

CPU Side – cudaMemcpy / cudaMemset

These operations are submitted through the command buffer and may be synchronous or asynchronous.

CPU Side – Kernel Call

The driver packages kernel launch parameters (grid, block, shared memory, argument pointers) into a command, writes it to the ring buffer, and rings the doorbell.

GPU Side – Command Processing

The GPU fetches commands via DMA, decodes them, distributes thread blocks to SMs, splits blocks into warps, schedules warps, and executes the SASS instructions.

Resource Release

When a thread block finishes, its SM resources are freed; when all blocks finish, the kernel launch completes.

GPU Hardware Architecture

A modern NVIDIA GPU consists of multiple GPCs (Graphics Processing Clusters), each containing several TPCs (Texture Processing Clusters), which in turn contain SMs (Streaming Multiprocessors).

SM Components

CUDA cores (FP32/INT32 ALUs)

Tensor cores (FMA for matrix‑multiply‑accumulate)

Register file, shared memory, L1 cache

Warp scheduler

Memory hierarchy (fast to slow): registers → L1 cache → L2 cache → HBM → host DRAM.

Programming Model vs Hardware Execution Model

CUDA’s programming model maps data to a Grid → Thread Block → Thread hierarchy. The hardware executes Grid → SM → Warp → Core.

Each thread computes a global index: int index = blockIdx.x * blockDim.x + threadIdx.x; enabling coalesced memory accesses.

SIMD vs SIMT

CPU SIMD executes the same instruction on multiple data lanes. CUDA uses SIMT: a warp issues a single instruction, but each thread has its own program counter, allowing divergent control flow via active masks.

Warp Divergence

In pre‑Volta GPUs a warp shared a program counter, causing serial execution of divergent branches. Post‑Volta GPUs give each thread an independent PC and mask, allowing the scheduler to hide latency by switching to ready warps.

Synchronization

__syncthreads()

synchronizes all threads in a block; __syncwarp() synchronizes the 32 threads of a warp.

Summary

GPU evolved from graphics to GPGPU.

CPU‑GPU collaboration uses command buffers and doorbells.

CUDA lifecycle: compilation, loading, context creation, kernel launch.

CUDA programming model (Grid‑Block‑Thread) maps to hardware (GPU‑SM‑Warp‑Core).

SIMT hides SIMD complexity; warp‑level and thread‑level latency hiding with synchronization primitives.

Future topics include streams, unified memory, MPS, and performance optimization.

References: https://www.generativevalue.com/p/nvidia-past-present-and-future, https://developer.nvidia.com/blog/even-easier-introduction-cuda, https://hackmd.io/@yaohsiaopid/ryHNKkxTr#GPU-Architecture, https://zhuanlan.zhihu.com/p/31825598174, https://cloud.tencent.com/developer/inventory/16026/article/1891497, https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/, https://github.com/gpu-mode/lectures, https://modal-cdn.com/gpu-glossary/terminal-gh100-sm.svg

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CUDAParallel ComputingAI inferenceGPUGPGPU
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.