Fundamentals 14 min read

GPU Overview: History, Architecture, Processing Workflow, and Acceleration Technologies (CUDA & OpenCL)

This article provides a comprehensive overview of GPUs, covering their history, architecture, processing workflow, and acceleration technologies such as CUDA and OpenCL, while comparing GPU and CPU designs and offering resources for further study.

Architects' Tech Alliance

Aug 29, 2021

GPU Overview: History, Architecture, Processing Workflow, and Acceleration Technologies (CUDA & OpenCL)

GPU stands for Graphics Processing Unit and is widely used in embedded systems, mobile phones, personal computers, workstations, and video game solutions. Modern GPUs excel at image and graphics processing because they are designed with a highly parallel architecture, giving them an advantage over general‑purpose CPUs for large‑scale parallel algorithms.

As data volumes grow and applications such as manufacturing simulation and autonomous driving demand GPU support, the need for computational acceleration expands across many industries. The author previously summarized basic GPU knowledge; the following sections provide a detailed overview.

1. Origin of the GPU

In August 1985, ATI was founded and later that year released its first graphics chip and graphics card using ASIC technology. In April 1992 ATI launched the Mach32 graphics card with integrated acceleration. Although ATI referred to its chips as VPU for many years, the term GPU was adopted after AMD acquired ATI.

NVIDIA introduced the concept of the GPU in 1999 with the GeForce 256 graphics processor. The GPU reduced reliance on the CPU and performed many tasks originally handled by the CPU, especially in 3D graphics processing. Key GPU technologies include hardware T&L, cube environment mapping, texture compression, bump mapping, and a dual‑texture 256‑bit rendering engine, with hardware T&L becoming a hallmark of GPUs.

2. Working Principle

2.1 GPU Workflow Overview

The GPU graphics pipeline performs the following steps (not necessarily in this order):

Vertex Processing: The GPU reads vertex data that describes the appearance of a 3D object, determines its shape and spatial relationships, and builds the object's skeleton. In GPUs supporting DirectX 8/9, this is implemented by a hardware Vertex Shader.

Rasterization: The generated geometry is converted into pixel data using algorithms that map vectors to discrete screen pixels.

Texture Mapping: The Texture Mapping Unit (TMU) applies images to the surfaces of polygons, creating realistic visual effects.

Pixel Processing: During rasterization, the GPU computes final pixel attributes using a Pixel Shader, and the Raster Operations Processor (ROP) outputs the completed frame to video memory.

Before GPUs, CPUs handled most computations, including multimedia processing, but their serial X86 architecture limited parallel data‑intensive workloads. CPUs have limited registers and cache, making them less suitable for high‑throughput parallel tasks.

GPUs, by contrast, consist of thousands of small, efficient cores designed for massive parallelism, allowing them to process millions of pixels simultaneously. This architectural difference gives GPUs a breakthrough in floating‑point performance.

Figure 2‑1 CPU and GPU Architecture

Figure 2‑2 Serial Computation Diagram

Figure 2‑3 Parallel Computation Diagram

Serial computation runs on a single CPU, executing one instruction at a time, whereas parallel computation distributes independent instructions across multiple processors, allowing simultaneous execution and significantly faster algorithm processing.

3. GPU Acceleration Technologies

3.1 CUDA

In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture), a programming model that enables developers to write C‑based programs that run on NVIDIA GPUs. CUDA provides a dedicated instruction set architecture, a parallel execution engine, and libraries such as CUFFT and CUBLAS for common high‑performance tasks.

The runtime environment offers APIs for memory management, device access, and kernel scheduling. CUDA code consists of host code (running on the CPU) and device code (running on the GPU). The driver layer abstracts hardware differences, allowing CUDA to evolve toward a vendor‑neutral GPU interface.

Developers can use CUDA with C, C++, Fortran, as well as other languages via bindings (Python, Java, MATLAB, etc.), and can also leverage OpenCL, DirectCompute, OpenGL Compute Shaders, or C++ AMP for heterogeneous computing.

3.2 OpenCL

OpenCL (Open Computing Language) is an open, cross‑platform framework for writing parallel programs that can run on CPUs, GPUs, DSPs, FPGAs, and other processors. Unlike CUDA, which is limited to NVIDIA hardware, OpenCL targets any parallel device, providing a unified programming model and API.

OpenCL programs consist of kernel code that executes on the device and host code that controls the platform. The standard is maintained by the Khronos Group and enables developers to write portable, high‑performance code across heterogeneous systems.

Source: Intelligent Computing Chip World

Related Downloads

GPU High‑Performance Computing Overview

GPU Deep Learning Fundamentals

OpenACC Introduction

CUDA C/C++ Programming Guide

CUDA Fortran Basics

For more architecture‑related knowledge summaries, refer to the "Architect’s Complete Technical Documentation Pack" (37 books) available as an e‑book.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CUDA GPU OpenCL

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.