GPU Overview: History, Architecture, Processing Workflow, and Acceleration Technologies (CUDA & OpenCL)
This article provides a comprehensive overview of GPUs, covering their history, architecture, processing workflow, and acceleration technologies such as CUDA and OpenCL, while comparing GPU and CPU designs and offering resources for further study.
GPU stands for Graphics Processing Unit and is widely used in embedded systems, mobile phones, personal computers, workstations, and video game solutions. Modern GPUs excel at image and graphics processing because they are designed with a highly parallel architecture, giving them an advantage over general‑purpose CPUs for large‑scale parallel algorithms.
As data volumes grow and applications such as manufacturing simulation and autonomous driving demand GPU support, the need for computational acceleration expands across many industries. The author previously summarized basic GPU knowledge; the following sections provide a detailed overview.
1. Origin of the GPU
In August 1985, ATI was founded and later that year released its first graphics chip and graphics card using ASIC technology. In April 1992 ATI launched the Mach32 graphics card with integrated acceleration. Although ATI referred to its chips as VPU for many years, the term GPU was adopted after AMD acquired ATI.
NVIDIA introduced the concept of the GPU in 1999 with the GeForce 256 graphics processor. The GPU reduced reliance on the CPU and performed many tasks originally handled by the CPU, especially in 3D graphics processing. Key GPU technologies include hardware T&L, cube environment mapping, texture compression, bump mapping, and a dual‑texture 256‑bit rendering engine, with hardware T&L becoming a hallmark of GPUs.
2. Working Principle
2.1 GPU Workflow Overview
The GPU graphics pipeline performs the following steps (not necessarily in this order):
Vertex Processing: The GPU reads vertex data that describes the appearance of a 3D object, determines its shape and spatial relationships, and builds the object's skeleton. In GPUs supporting DirectX 8/9, this is implemented by a hardware Vertex Shader.
Rasterization: The generated geometry is converted into pixel data using algorithms that map vectors to discrete screen pixels.
Texture Mapping: The Texture Mapping Unit (TMU) applies images to the surfaces of polygons, creating realistic visual effects.
Pixel Processing: During rasterization, the GPU computes final pixel attributes using a Pixel Shader, and the Raster Operations Processor (ROP) outputs the completed frame to video memory.
Before GPUs, CPUs handled most computations, including multimedia processing, but their serial X86 architecture limited parallel data‑intensive workloads. CPUs have limited registers and cache, making them less suitable for high‑throughput parallel tasks.
GPUs, by contrast, consist of thousands of small, efficient cores designed for massive parallelism, allowing them to process millions of pixels simultaneously. This architectural difference gives GPUs a breakthrough in floating‑point performance.
Figure 2‑1 CPU and GPU Architecture
Figure 2‑2 Serial Computation Diagram
Figure 2‑3 Parallel Computation Diagram
Serial computation runs on a single CPU, executing one instruction at a time, whereas parallel computation distributes independent instructions across multiple processors, allowing simultaneous execution and significantly faster algorithm processing.
3. GPU Acceleration Technologies
3.1 CUDA
In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture), a programming model that enables developers to write C‑based programs that run on NVIDIA GPUs. CUDA provides a dedicated instruction set architecture, a parallel execution engine, and libraries such as CUFFT and CUBLAS for common high‑performance tasks.
The runtime environment offers APIs for memory management, device access, and kernel scheduling. CUDA code consists of host code (running on the CPU) and device code (running on the GPU). The driver layer abstracts hardware differences, allowing CUDA to evolve toward a vendor‑neutral GPU interface.
Developers can use CUDA with C, C++, Fortran, as well as other languages via bindings (Python, Java, MATLAB, etc.), and can also leverage OpenCL, DirectCompute, OpenGL Compute Shaders, or C++ AMP for heterogeneous computing.
3.2 OpenCL
OpenCL (Open Computing Language) is an open, cross‑platform framework for writing parallel programs that can run on CPUs, GPUs, DSPs, FPGAs, and other processors. Unlike CUDA, which is limited to NVIDIA hardware, OpenCL targets any parallel device, providing a unified programming model and API.
OpenCL programs consist of kernel code that executes on the device and host code that controls the platform. The standard is maintained by the Khronos Group and enables developers to write portable, high‑performance code across heterogeneous systems.
Source: Intelligent Computing Chip World
Related Downloads
GPU High‑Performance Computing Overview
GPU Deep Learning Fundamentals
OpenACC Introduction
CUDA C/C++ Programming Guide
CUDA Fortran Basics
For more architecture‑related knowledge summaries, refer to the "Architect’s Complete Technical Documentation Pack" (37 books) available as an e‑book.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.