Decoding Chip Concepts: CPU, GPU, NPU, APU, SoC, HBM & Chiplet (2026)
This article breaks down the core chip concepts—CPU, GPU, NPU, APU, SoC, HBM and Chiplet—explaining their functions, key characteristics, historical evolution, and how they relate to each other, and provides a 2026 mainstream‑chip comparison and selection guide.
CPU – General‑purpose processor
Full name: Central Processing Unit.
Key characteristics
Core count typically 4–64 for desktop/servers.
Strengths: logic, branch prediction, task scheduling.
Weakness: limited large‑scale parallelism.
Representative products: Intel Core i9, AMD Ryzen 9, Apple M5.
Architectures: x86 (Intel/AMD), ARM (Apple/Qualcomm/MediaTek).
Internal components
ALU – arithmetic‑logic operations.
CU – instruction decode and control flow.
Cache hierarchy (L1/L2/L3) – fast, progressively larger storage.
Registers – fastest storage directly attached to execution units.
Why a CPU cannot replace a GPU – CPUs follow a “fast‑and‑smart” design: each core handles complex logic but the core count is limited, analogous to a professor solving difficult problems for one student at a time. GPUs follow a “many‑and‑fast” design: thousands of simple cores work in parallel, analogous to many elementary students each doing simple addition simultaneously.
GPU – Parallel‑compute engine
Full name: Graphics Processing Unit.
Key characteristics
Core count ranges from thousands to tens of thousands (CUDA cores / shaders).
Strengths: massive parallel tasks such as matrix multiplication and graphics rendering.
Weakness: complex logical branching.
Representative products: NVIDIA RTX 5090, RTX Spark, Apple M5 GPU.
Key metrics: CUDA core count, memory capacity/bandwidth, TFLOPS.
GPU evolution
Pre‑1999: Fixed‑function rasterization pipeline.
2001: Programmable shaders (GeForce 3) enable custom rendering.
2006: CUDA released, enabling general‑purpose (GPGPU) computing.
2012: AlexNet uses GPU for training, sparking the AI boom.
2017: Volta introduces Tensor Cores for matrix acceleration.
2018+: RTX series adds real‑time ray tracing and DLSS.
2025: Blackwell architecture introduces FP4 precision, doubling AI inference efficiency.
GPU core units
CUDA cores – general parallel compute units (NVIDIA terminology).
Tensor Cores – matrix‑multiply‑accumulate units optimized for AI training and inference.
RT Cores – ray‑tracing acceleration.
Shaders – AMD/Intel term for programmable compute units.
NPU – AI‑specific inference accelerator
Full name: Neural Processing Unit.
Key characteristics
Specialized for neural‑network inference (convolution, matrix multiply, activation).
Power consumption in the milliwatt range on mobile devices.
Flexibility limited to AI ops; cannot perform general computation.
Representative products: Apple Neural Engine, Qualcomm Hexagon NPU, Intel NPU.
Typical scenarios: phone face‑unlock, AI photo processing, local PC Copilot.
Comparison with GPU
Positioning: AI inference‑only vs. general parallel compute + AI.
Flexibility: Low (only neural nets) vs. high (programmable).
Energy efficiency: Very high vs. medium.
Performance ceiling: Medium vs. very high.
Typical devices: Phones, thin laptops vs. desktops, servers, data‑center rigs.
AI training support: No vs. yes.
APU – AMD’s integrated CPU + GPU solution
Full name: Accelerated Processing Unit.
Key points
Defined as CPU + GPU integrated in a single package.
Introduced by AMD in 2011 to reduce cost and power while meeting everyday workloads.
Evolved from AMD APU → AMD Ryzen APU and eventually merged into the mainstream Ryzen product line.
Because modern CPUs from Intel and AMD now embed GPUs, the term “APU” has largely faded.
SoC – System on Chip
Full name: System on Chip.
Key characteristics
Integrates CPU, GPU, NPU, ISP, memory controller, network baseband, and other modules on a single silicon die.
Advantages: low power, small form factor, ultra‑low communication latency.
Disadvantages: difficult to upgrade because components are soldered to the board.
Modern SoC example (Apple M5 Pro)
CPU cores – performance + efficiency clusters.
GPU cores – graphics + parallel compute.
Neural Engine – AI inference (NPU).
ISP – image‑signal processing for cameras.
Media Engine – video encode/decode acceleration.
Memory controller – unified memory management.
Secure Enclave – secure storage/encryption.
Thunderbolt/NIC – external I/O.
Representative SoC products
Apple M5 series (MacBook, iPad, Mac Studio).
Qualcomm Snapdragon X Elite (Windows‑on‑ARM laptops).
NVIDIA RTX Spark – first NVIDIA SoC with CPU + GPU + 128 GB unified memory.
MediaTek Dimensity 9400 (Android flagship).
Huawei Kirin 9020 (smartphone flagship).
HBM – High‑Bandwidth Memory
Full name: High Bandwidth Memory.
Core innovation – 3‑D stacking of DRAM dies connected by through‑silicon vias (TSV).
Bandwidth
HBM3E: 4–5 TB/s, 8–16 GB per stack.
HBM4: 6–8 TB/s, capacity higher than HBM3E.
Power and cost
More power‑efficient than GDDR.
Very high manufacturing cost.
Main users – AI/HPC GPUs (e.g., NVIDIA H100, B200) and high‑end graphics cards.
HBM vs. traditional memory
GDDR7: 1–2 TB/s, 16–32 GB, targeted at gaming graphics.
LPDDR5X: 0.1 TB/s, 16–128 GB, used in phones and thin laptops.
Why AI chips need HBM – Large‑model training and inference are often limited by memory bandwidth rather than compute; HBM provides a “wide conveyor belt” so the GPU’s compute capacity is not starved of data.
HBM stacking principle
Vertically stack multiple DRAM dies.
Connect layers via TSV.
Attach the stack to the GPU substrate with micro‑bumps.
Typical configuration: 4–8 layers, each 2 GB, total 8–16 GB.
Additional related concepts
TPU – Tensor Processing Unit
Vendor: Google (custom).
Specialty: AI training + inference, optimized for TensorFlow; does not support general compute.
Representative: Google TPU v5p (used for Gemini large‑model training).
LPU – Language Processing Unit
Vendor: Groq.
Specialty: LLM inference with extremely low latency.
Features: Uses SRAM instead of HBM, enabling very fast inference for 1–2 large models; high cost and limited capacity.
FPGA – Field‑Programmable Gate Array
Programmable hardware circuits, high flexibility.
Typical use cases: prototype verification, low‑latency trading, network acceleration.
Representatives: AMD Xilinx, Intel Altera.
DSP – Digital Signal Processor
Specialty: audio/video codec and communication signal processing.
Typical scenarios: phone call noise reduction, Bluetooth audio, 5G baseband.
Current status: functionality largely integrated into SoCs, no longer a separate component.
Unified Memory
Concept: CPU and GPU share the same physical memory pool.
Benefit: eliminates data copies, resulting in ultra‑low latency.
Representatives: Apple Silicon (128 GB), NVIDIA RTX Spark (128 GB).
Contrast: traditional systems have separate DDR for CPU and dedicated graphics memory, requiring PCIe transfers.
Chiplet
Concept: large chips are divided into multiple smaller dies that are manufactured separately and then packaged together.
Advantages: higher yield, lower cost, flexible composition.
Representatives: AMD Ryzen (CCD + IOD), Intel Meteor Lake.
Vs SoC: a SoC is a monolithic die; a chiplet architecture assembles several small dies.
2026 mainstream chip comparison (selected highlights)
Apple M5 Pro – Laptop/desktop, ARM 12‑core CPU, 20‑core GPU, 18‑core NPU, 128 GB unified memory, SoC‑based.
Apple M5 Ultra – Desktop, ARM 32‑core CPU, 80‑core GPU, 64‑core NPU, 512 GB unified memory, SoC‑based.
NVIDIA RTX Spark – Laptop/mini‑PC, ARM 20‑core Grace CPU, 6144 CUDA cores, built‑in NPU, 128 GB unified memory, first NVIDIA SoC.
NVIDIA RTX 5090 – Desktop GPU, 21760 CUDA cores, 32 GB GDDR7, no integrated SoC.
NVIDIA B200 – Data‑center accelerator, Blackwell GPU, 192 GB HBM3e, no CPU.
Snapdragon X Elite – Notebook, ARM 12‑core CPU, Adreno GPU, 45 TOPS NPU, 64 GB LPDDR5X, SoC‑based.
Intel Core Ultra 9 – Notebook, x86 24‑core CPU, Arc GPU, 48 TOPS NPU, 64 GB LPDDR5X, SoC‑based.
AMD Ryzen 9 9950X – Desktop, x86 16‑core CPU, Radeon GPU, 128 GB DDR5, built with chiplet architecture (not a SoC).
Key finding : The 2026 trend is clear – SoC + large unified memory + integrated NPU is becoming the mainstream architecture across Apple, NVIDIA, Qualcomm, Intel and others.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lao Guo's Learning Space
AI learning, discussion, and hands‑on practice with self‑reflection
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
