Understanding GPU Architecture and Its Evolution
This article explains the historical development of graphics processing units, their internal structure, rendering pipeline, and how GPUs shifted graphics workloads from CPUs to specialized parallel hardware, highlighting key concepts such as vertex shaders, pixel shaders, SIMD architectures, and performance growth.
Before GPUs existed, graphics cards acted like a "master‑servant" relationship with the CPU: the card was merely a brush that followed CPU commands to shade, texture, render, and output images.
Early 3D acceleration cards still relied heavily on the CPU for coordinate processing and lighting, limiting frame rates and visual smoothness.
In August 1999 NVIDIA released the GeForce 256 and introduced the term GPU, a processor capable of handling almost all graphics‑related calculations that were previously the CPU’s domain.
Modern GPUs perform vertex setup, lighting, pixel operations, and more, essentially acting as a collection of hardware‑implemented graphics functions.
A typical GPU contains a 2D Engine, 3D Engine, Video‑Processing Engine, FSAA Engine, and memory‑management units; the 3D Engine (or 3DEngine) is the core of contemporary graphics cards.
The rendering pipeline proceeds through vertex processing (vertex shader), setup (triangle assembly), rasterization, texture mapping, pixel processing (pixel shader), and final output via the ROP to the frame buffer.
Vertex processing: reads vertex data, builds the 3D skeleton, and is performed by the hardware vertex shader.
Rasterization: converts geometric primitives into pixel fragments.
Texture mapping: applies image data to polygon surfaces to create realistic visuals.
Pixel processing: computes final pixel colors using pixel shaders.
Final output: the ROP writes the completed frame to video memory.
Traditional GPUs use a SIMD architecture with ALUs that can execute the same instruction on multiple data streams simultaneously; specialized 4‑DALU units handle 4‑component vectors efficiently.
When processing non‑4D instructions, efficiency drops, so techniques like Co‑issue (1D+3D or 2D+2D ALUs) were introduced to improve utilization.
Compared to CPUs, GPUs allocate most transistors to parallel arithmetic units rather than control logic or cache, giving them a massive advantage in floating‑point performance, with GPU performance roughly doubling every six months.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.