What Powers Modern Graphics? A Deep Dive into GPU History and Architecture
This article traces the evolution of GPUs from early graphics chips to modern parallel processors, explains their internal pipeline, compares CPU and GPU architectures, and introduces key acceleration frameworks like CUDA and OpenCL for general‑purpose computing.
1. Origin of GPUs
GPU stands for Graphics Processing Unit, used in embedded systems, mobile phones, personal computers, workstations and game consoles. Its highly parallel architecture makes it far more efficient than general‑purpose CPUs for data‑parallel algorithms.
ATI was founded in August 1985; later that year it released its first ASIC‑based graphics chip and card. In April 1992 ATI launched the Mach32 graphics card with acceleration, and by April 1998 ATI was recognized as a market leader, although the term “GPU” had not yet been adopted—the chips were called VPU until AMD’s acquisition introduced the GPU name.
In 1999 NVIDIA introduced the GeForce 256 and coined the term GPU. The GPU reduced reliance on the CPU and handled many tasks formerly performed by the CPU, especially 3D graphics. Key technologies included hardware Transform & Lighting (T&L), texture mapping, bump mapping, dual‑texture 256‑bit rendering, etc., with hardware T&L becoming a hallmark of GPUs.
2. How GPUs Work
2.1 Graphics pipeline
Vertex processing : reads vertex data, computes shape and position, and builds the 3D skeleton. Implemented by the Vertex Shader in DirectX 8/9 GPUs.
Rasterization : converts geometric primitives into pixel fragments, mapping vectors to screen pixels.
Texture mapping : applies image data to polygon surfaces via the Texture Mapping Unit (TMU) to produce realistic visuals.
Pixel processing : executes per‑pixel calculations using the Pixel Shader and writes the final colour through the Render Output Unit (ROP) to the frame buffer.
CPUs are optimized for serial execution of x86 instructions and have limited parallelism and memory bandwidth for multimedia workloads. SIMD extensions such as Intel SSE improve parallelism but cannot match the thousands of small cores in a GPU that are dedicated to data‑parallel tasks.
Architecturally, CPUs contain complex control logic, large caches and a few powerful cores for diverse tasks. GPUs consist mainly of many stream processors, simple control units and high‑throughput memory controllers, enabling massive floating‑point throughput.
3. GPU Acceleration Technologies
3.1 CUDA
In 2006 NVIDIA released CUDA (Compute Unified Device Architecture), a general‑purpose parallel computing platform that lets developers write C‑based code for GPUs. CUDA defines an ISA, a parallel execution engine, and a set of libraries such as CUFFT (FFT) and CUBLAS (BLAS).
A CUDA program consists of host code (executed on the CPU) and device code (executed on the GPU). The runtime API provides functions for memory allocation, data transfer, kernel launch and synchronization. Compilation is performed with nvcc.
Supported languages include C, C++, Fortran and, via wrappers, Python, Java, MATLAB and others. A minimal kernel launch looks like this:
#include <cuda_runtime.h>
__global__ void add(int *a, int *b, int *c) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
c[idx] = a[idx] + b[idx];
}
int main() {
// allocate and copy data, launch kernel, copy back results
return 0;
}CUDA drivers act as a hardware abstraction layer, enabling potential cross‑vendor compatibility for future GPUs.
3.2 OpenCL
OpenCL (Open Computing Language) is an open, royalty‑free standard maintained by the Khronos Group for heterogeneous computing across CPUs, GPUs, DSPs and FPGAs. It defines a C‑like kernel language and an API for device discovery, context creation, command‑queue management and synchronization.
Unlike CUDA, which runs only on NVIDIA GPUs, OpenCL targets any parallel processor, providing portability across vendors. An OpenCL program also separates host and kernel code; the host uses the OpenCL API to compile kernels at runtime and enqueue them on one or more devices.
Typical OpenCL workflow:
Query platforms and devices with clGetPlatformIDs and clGetDeviceIDs.
Create a context ( clCreateContext) and command queue ( clCreateCommandQueue).
Write kernel source, compile with clBuildProgram, and create a kernel object ( clCreateKernel).
Allocate buffers ( clCreateBuffer), transfer data ( clEnqueueWriteBuffer), launch kernel ( clEnqueueNDRangeKernel), and read results back ( clEnqueueReadBuffer).
OpenCL supports both task‑parallel and data‑parallel models, extending GPU usage beyond graphics to scientific computing, image processing, machine learning and other general‑purpose workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
