Industry Insights 19 min read

What Do GPU Core Specs Really Mean? A Deep Dive into Modern GPU Performance

This article provides a comprehensive analysis of GPU core parameters—including compute units, memory systems, floating‑point performance, power consumption, and manufacturing process—while comparing leading international and domestic GPU products to help readers choose the right accelerator for AI, HPC, or graphics workloads.

Architects' Tech Alliance

May 20, 2025

What Do GPU Core Specs Really Mean? A Deep Dive into Modern GPU Performance

GPU Core Parameter System Analysis

Basic Compute Units

CUDA cores / Stream Processors are the fundamental compute units that execute graphics and general‑purpose tasks. NVIDIA calls them CUDA cores, AMD calls them stream processors. The number of cores influences parallel processing capability, but architectural efficiency varies across generations.

Tensor cores are specialized units introduced by NVIDIA for deep‑learning operations, accelerating matrix multiplications. Starting with the Volta architecture, Tensor cores enable high‑throughput AI training and inference (e.g., A100’s TF32 support provides near‑FP16 speed with minimal precision loss).

RT cores handle real‑time ray‑tracing calculations and are typically present only in consumer‑grade GPUs. Data‑center GPUs usually omit RT cores to focus on general compute performance.

Memory System

Memory capacity determines how much data a GPU can hold for a single operation. Large models such as DeepSeek LLM‑67B benefit from high‑capacity GPUs (e.g., A100 with 80 GB HBM2e, RTX 4090 with 24 GB GDDR6X).

Memory bandwidth measures the data transfer rate between GPU and memory, expressed in GB/s. High bandwidth is crucial for AI training; HBM2e on A100 reaches ~2 TB/s, while GDDR6X on RTX 4090 offers ~1 TB/s.

Memory type influences bandwidth and energy efficiency. Common types include:

GDDR6/GDDR6X – widely used in consumer GPUs (e.g., RTX 4090).

HBM2e/HBM3 – high‑bandwidth, high‑cost solutions for data‑center GPUs (e.g., A100, Blackwell Ultra).

GDDR7 – upcoming generation expected in future RTX 50 series.

Compute Performance Metrics

Floating‑point performance (TFLOPS) is the primary indicator of raw compute power, measured for different precisions (FP64, FP32, FP16). RTX 4090 delivers ~82.58 TFLOPS FP32, while A100 reaches 312 TFLOPS FP16.

Power (TDP) reflects the thermal design power, affecting cooling requirements and deployment density. High‑performance GPUs like RTX 4090 consume up to 450 W, whereas inference‑optimized cards such as NVIDIA T4 operate at only 70 W.

Manufacturing process (nm) impacts transistor density and efficiency. Advances from 7 nm (Ampere) to 4 nm (Blackwell) provide higher performance per watt.

International Mainstream GPU Product Parameters

Data‑Center GPUs

NVIDIA A100 (Ampere) remains a benchmark for AI training with 80 GB HBM2e, 2 TB/s bandwidth, and 312 TFLOPS FP16. It supports NVLink (600 GB/s) for multi‑GPU scaling.

NVIDIA H100 (Hopper) upgrades to 80 GB HBM3 and 3.35 TB/s bandwidth, introduces the Transformer Engine for automatic FP8/FP16 switching, and offers 2–3× speedup over A100 for large‑scale models.

Blackwell series (2025) includes B100/B200/B300 variants; the flagship B300 features 288 GB HBM3e and up to 15 PFLOPS FP8, delivering ~3.75× the performance of H100.

NVIDIA T4 is an inference‑focused, low‑power GPU (70 W, 16 GB GDDR6) suitable for dense 1U deployments, offering 130 TOPS INT8 performance.

Consumer & Workstation GPUs

RTX 4090 (Ada Lovelace) provides 16 384 CUDA cores, 24 GB GDDR6X, and 82.58 TFLOPS FP32. While primarily a gaming card, its compute power makes it viable for AI inference and medium‑scale model fine‑tuning.

RTX 6000 Ada (professional) offers 18 176 CUDA cores, 48 GB ECC GDDR6, 91.1 TFLOPS FP32, NVLink support, and enhanced driver stability for AI workloads.

RTX 4000 Ada targets entry‑level AI and workstation use with 20 GB ECC memory, 140 W TDP, and the same core architecture as the 6000 Ada.

Architecture Evolution

From Ampere to Hopper to Blackwell, NVIDIA’s data‑center GPU architectures have steadily improved performance, bandwidth, and energy efficiency. Each generation introduces new precision formats (e.g., FP8, FP4) and interconnects (PCIe 5.0, NVLink 4.0).

Domestic GPU Product Parameters and Performance Comparison

Huawei Ascend 910B (Da Vinci architecture, 7 nm) delivers 376 TFLOPS FP16, 350 W TDP, and 400 GB/s HBM bandwidth, comparable to NVIDIA A100.

Cambricon MLU590 (MLUv02) offers 314 TFLOPS FP16, 80 GB memory, 2 TB/s bandwidth, and flexible clustering for cloud and edge deployments.

HaiGuang K100 series includes AI and standard variants; the AI version provides 196 TFLOPS FP16, 64 GB memory, and 896 GB/s bandwidth, with ROCm compatibility for CUDA migration.

TianShu ZhiXin offers two lines: TianGan 100 (training, 147 TFLOPS FP16/BF16) and ZhiKai 100 (inference, 200 TFLOPS, 150 W). Both support CUDA ecosystems.

Performance Parameter Comparison

Compute performance : Ascend 910B leads with 376 TFLOPS FP16, followed by MLU590 (314 TFLOPS). Domestic GPUs approach flagship international levels in specific scenarios.

Memory system : MLU590’s 80 GB HBM and 2 TB/s bandwidth surpass even A100’s HBM2e, while K100 series balances 64 GB memory with 896 GB/s bandwidth for memory‑intensive tasks.

Power efficiency : Domestic solutions vary from high‑power (350 W) designs focusing on peak performance to low‑power (150 W) inference‑optimized chips like ZhiKai 100, offering competitive performance per watt.

Understanding these core parameters enables informed decisions when matching GPUs to workloads—AI training favors high memory capacity and bandwidth, inference benefits from energy‑efficient designs, and graphics rendering relies on strong CUDA and RT core capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance architecture AI Hardware GPU memory Benchmarking

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.