Fundamentals 13 min read

What Powers Modern Graphics? A Deep Dive into GPU History and Architecture

This article traces the evolution of GPUs from early graphics chips to modern parallel processors, explains their internal pipeline, compares CPU and GPU architectures, and introduces key acceleration frameworks like CUDA and OpenCL for general‑purpose computing.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
What Powers Modern Graphics? A Deep Dive into GPU History and Architecture

1. Origin of GPUs

GPU stands for Graphics Processing Unit, used in embedded systems, mobile phones, personal computers, workstations and game consoles. Its highly parallel architecture makes it far more efficient than general‑purpose CPUs for data‑parallel algorithms.

ATI was founded in August 1985; later that year it released its first ASIC‑based graphics chip and card. In April 1992 ATI launched the Mach32 graphics card with acceleration, and by April 1998 ATI was recognized as a market leader, although the term “GPU” had not yet been adopted—the chips were called VPU until AMD’s acquisition introduced the GPU name.

In 1999 NVIDIA introduced the GeForce 256 and coined the term GPU. The GPU reduced reliance on the CPU and handled many tasks formerly performed by the CPU, especially 3D graphics. Key technologies included hardware Transform & Lighting (T&L), texture mapping, bump mapping, dual‑texture 256‑bit rendering, etc., with hardware T&L becoming a hallmark of GPUs.

2. How GPUs Work

2.1 Graphics pipeline

Vertex processing : reads vertex data, computes shape and position, and builds the 3D skeleton. Implemented by the Vertex Shader in DirectX 8/9 GPUs.

Rasterization : converts geometric primitives into pixel fragments, mapping vectors to screen pixels.

Texture mapping : applies image data to polygon surfaces via the Texture Mapping Unit (TMU) to produce realistic visuals.

Pixel processing : executes per‑pixel calculations using the Pixel Shader and writes the final colour through the Render Output Unit (ROP) to the frame buffer.

CPUs are optimized for serial execution of x86 instructions and have limited parallelism and memory bandwidth for multimedia workloads. SIMD extensions such as Intel SSE improve parallelism but cannot match the thousands of small cores in a GPU that are dedicated to data‑parallel tasks.

Architecturally, CPUs contain complex control logic, large caches and a few powerful cores for diverse tasks. GPUs consist mainly of many stream processors, simple control units and high‑throughput memory controllers, enabling massive floating‑point throughput.

CPU vs GPU architecture
CPU vs GPU architecture
Serial execution diagram
Serial execution diagram
Parallel execution diagram
Parallel execution diagram

3. GPU Acceleration Technologies

3.1 CUDA

In 2006 NVIDIA released CUDA (Compute Unified Device Architecture), a general‑purpose parallel computing platform that lets developers write C‑based code for GPUs. CUDA defines an ISA, a parallel execution engine, and a set of libraries such as CUFFT (FFT) and CUBLAS (BLAS).

A CUDA program consists of host code (executed on the CPU) and device code (executed on the GPU). The runtime API provides functions for memory allocation, data transfer, kernel launch and synchronization. Compilation is performed with nvcc.

Supported languages include C, C++, Fortran and, via wrappers, Python, Java, MATLAB and others. A minimal kernel launch looks like this:

#include <cuda_runtime.h>

__global__ void add(int *a, int *b, int *c) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    c[idx] = a[idx] + b[idx];
}

int main() {
    // allocate and copy data, launch kernel, copy back results
    return 0;
}

CUDA drivers act as a hardware abstraction layer, enabling potential cross‑vendor compatibility for future GPUs.

CUDA processing flow
CUDA processing flow

3.2 OpenCL

OpenCL (Open Computing Language) is an open, royalty‑free standard maintained by the Khronos Group for heterogeneous computing across CPUs, GPUs, DSPs and FPGAs. It defines a C‑like kernel language and an API for device discovery, context creation, command‑queue management and synchronization.

Unlike CUDA, which runs only on NVIDIA GPUs, OpenCL targets any parallel processor, providing portability across vendors. An OpenCL program also separates host and kernel code; the host uses the OpenCL API to compile kernels at runtime and enqueue them on one or more devices.

Typical OpenCL workflow:

Query platforms and devices with clGetPlatformIDs and clGetDeviceIDs.

Create a context ( clCreateContext) and command queue ( clCreateCommandQueue).

Write kernel source, compile with clBuildProgram, and create a kernel object ( clCreateKernel).

Allocate buffers ( clCreateBuffer), transfer data ( clEnqueueWriteBuffer), launch kernel ( clEnqueueNDRangeKernel), and read results back ( clEnqueueReadBuffer).

OpenCL supports both task‑parallel and data‑parallel models, extending GPU usage beyond graphics to scientific computing, image processing, machine learning and other general‑purpose workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CUDAparallel computingGPUOpenCLGPU architectureGraphics Processing
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.