Fundamentals 11 min read

Understanding GPU Architecture and Its Evolution

This article explains the historical development of graphics processing units, their internal structure, rendering pipeline, and how GPUs shifted graphics workloads from CPUs to specialized parallel hardware, highlighting key concepts such as vertex shaders, pixel shaders, SIMD architectures, and performance growth.

Architects' Tech Alliance

Apr 18, 2018

Understanding GPU Architecture and Its Evolution

Before GPUs existed, graphics cards acted like a "master‑servant" relationship with the CPU: the card was merely a brush that followed CPU commands to shade, texture, render, and output images.

Early 3D acceleration cards still relied heavily on the CPU for coordinate processing and lighting, limiting frame rates and visual smoothness.

In August 1999 NVIDIA released the GeForce 256 and introduced the term GPU, a processor capable of handling almost all graphics‑related calculations that were previously the CPU’s domain.

Modern GPUs perform vertex setup, lighting, pixel operations, and more, essentially acting as a collection of hardware‑implemented graphics functions.

A typical GPU contains a 2D Engine, 3D Engine, Video‑Processing Engine, FSAA Engine, and memory‑management units; the 3D Engine (or 3DEngine) is the core of contemporary graphics cards.

The rendering pipeline proceeds through vertex processing (vertex shader), setup (triangle assembly), rasterization, texture mapping, pixel processing (pixel shader), and final output via the ROP to the frame buffer.

Vertex processing: reads vertex data, builds the 3D skeleton, and is performed by the hardware vertex shader.

Rasterization: converts geometric primitives into pixel fragments.

Texture mapping: applies image data to polygon surfaces to create realistic visuals.

Pixel processing: computes final pixel colors using pixel shaders.

Final output: the ROP writes the completed frame to video memory.

Traditional GPUs use a SIMD architecture with ALUs that can execute the same instruction on multiple data streams simultaneously; specialized 4‑DALU units handle 4‑component vectors efficiently.

When processing non‑4D instructions, efficiency drops, so techniques like Co‑issue (1D+3D or 2D+2D ALUs) were introduced to improve utilization.

Compared to CPUs, GPUs allocate most transistors to parallel arithmetic units rather than control logic or cache, giving them a massive advantage in floating‑point performance, with GPU performance roughly doubling every six months.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Rendering Pipeline Hardware acceleration computer fundamentals graphics architecture

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.