Which NVIDIA GPU Wins for AI? Deep Dive into RTX & A‑Series Performance and Power

This article presents a detailed comparison of major NVIDIA GPUs—including RTX 4090, RTX 4090 D, RTX 3090, A10, A40, A100, and H100—covering memory size, bandwidth, Tensor BF16/FP16/FP32 throughput, FP16/FP32 performance, power draw and release dates, and explains how these specs affect AI workload efficiency.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Which NVIDIA GPU Wins for AI? Deep Dive into RTX & A‑Series Performance and Power

GPU Technical Comparison for Deep‑Learning and HPC

This summary presents a side‑by‑side technical comparison of NVIDIA GPUs commonly used for deep‑learning and high‑performance computing. For each model the following specifications are listed:

Memory capacity

Memory bandwidth

Tensor‑core performance for BF16, FP16 and FP32 (TFLOPS)

Standard FP16/FP32 performance (TFLOPS)

Maximum power consumption (W)

Release date

GPU Model            | Memory | Bandwidth | Tensor BF16/FP16/FP32 (TFLOPS) | FP16/FP32 (TFLOPS) | Power (W) | Release Date
---------------------|--------|-----------|--------------------------------|--------------------|-----------|-------------
NVIDIA GeForce RTX 4090          | 24 GB | 1.01 TB/s | 165.2 / 165.2 / 82.58 | 82.58 / 82.58 | 450 | Sep 2022
NVIDIA GeForce RTX 4090 D        | 24 GB | 1008 GB/s | ~156 / 156 / 78 | 73.54 / 73.54 | 425 | Dec 2023
NVIDIA GeForce RTX 3090          | 24 GB | 936.2 GB/s | 71 / 71 / 35.58 | 35.58 / 35.58 | 425 | Sep 2020
NVIDIA A10                       | 24 GB | 600.2 GB/s | 125 / 125 / 62.5 | 23.44 / 31.2 | 150 | Feb 2022
NVIDIA A40 PCIe                  | 48 GB | 695.8 GB/s | 149.7 / 149.7 / 74.8 | 37.42 / 37.42 | 300 | Oct 2020
NVIDIA A100 PCIe                 | 80 GB | 1935 GB/s | 312 / 312 / 156 | 77.97 / 19.49 | 300 | Jun 2021
NVIDIA A100 SXM4                 | 80 GB | 2039 GB/s | – | 77.97 / 19.49 | 400 | Nov 2020
NVIDIA A800 PCIe                 | 80 GB | 2039 GB/s | 312 / 312 / 156 | 77.97 / 19.49 | 250 | Nov 2022
NVIDIA A800 SXM4                 | 80 GB | 2039 GB/s | – | 77.97 / 19.49 | 400 | Aug 2022
NVIDIA L20                       | 48 GB | 864 GB/s | 119.5 / 119.5 / 59.8 | 59.35 / 59.35 | 275 | Nov 2023
NVIDIA L40                       | 48 GB | 864 GB/s | 181.05 / 181.05 / 90.5 | 90.52 / 90.52 | 300 | Oct 2022
NVIDIA H100 SXM5                 | 80 GB | 1681 GB/s | 1979 / 1979 / 989 | 267.6 / 66.91 | 700 | Mar 2023
NVIDIA H100 PCIe                 | 80 GB | 2040 GB/s | 1513 / 1513 / 756 | 204.9 / 51.22 | 350 | Mar 2023
NVIDIA H100 NVL                  | 80 GB | 2040 GB/s | 3958 / 3958 / 1979 | 204.9 / 51.22 | 350 | Mar 2023

What is TFLOPS?

TFLOPS stands for “tera‑floating‑point operations per second”, where “tera” denotes 10¹². It measures the raw floating‑point compute capability of a GPU. Higher TFLOPS values indicate greater theoretical throughput for the corresponding precision (BF16, FP16, FP32).

Power Consumption and Interface Considerations

Power envelope: SXM form factors (e.g., H100 SXM5) can draw up to 700 W, requiring robust power delivery and cooling. PCIe variants are limited to ~350 W, making them suitable for servers with tighter power budgets.

Performance gap: SXM cards typically provide higher raw compute and memory bandwidth (up to 3.35 TB/s) and support NVLink with up to 900 GB/s bidirectional bandwidth, which benefits multi‑GPU scaling.

PCIe limitations: PCIe GPUs have lower memory bandwidth (around 2 TB/s for H100 PCIe) and lack the full NVLink bandwidth, resulting in slightly reduced overall performance compared to their SXM counterparts.

Interface compatibility: Both SXM and PCIe versions can be linked in multi‑GPU setups, but they differ in power draw, thermal requirements, and motherboard compatibility.

When selecting a GPU, consider the computational demands of your AI models, the available power and cooling infrastructure, and the need for high‑speed inter‑GPU communication (NVLink). For workloads that heavily rely on Tensor operations (BF16/FP16/FP32), the TFLOPS figures above provide a direct indication of expected training speed.

Reference: https://juejin.cn/post/7428197475964272690

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceGPUNvidiaIndustry analysisAI workloadspower consumption
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.