Analysis of Advanced High‑Performance Processors for Exascale Computing: Fujitsu A64FX, NVIDIA H100, AMD MI250X, and Intel PonteVecchio
This article examines four leading exascale‑grade high‑performance processors—Fujitsu A64FX, NVIDIA H100, AMD MI250X, and Intel PonteVecchio—detailing their core architectures, compute resources, memory hierarchies, specialized accelerators, process technologies, performance metrics, and trends to inform future domestic processor development.
The commercial high‑performance computing processor market is dominated by NVIDIA, AMD, and Intel. For exascale (E‑level) computing, AMD’s Instinct MI250X achieves a double‑precision peak of 95.7 TFLOPS, while the latest NVIDIA and Intel processors also reach several tens of TFLOPS.
1. Fujitsu A64FX
Released in 2018 for Japan’s POST‑K (later Fugaku) supercomputer, the A64FX chip integrates 158,976 cores across the system, delivering a peak performance of 0.537 EFLOPS and a Linpack measured 0.442 EFLOPS (≈82 % efficiency). Each CMG (CPU Memory Group) contains 13 homogeneous cores (12 compute, 1 auxiliary), 8 GB HBM2, and a 1024 GB/s bandwidth. The chip uses TSMC 7 nm and CoWoS packaging, contains 8.786 billion transistors, runs at 2.2 GHz, peaks at 3.379 TFLOPS, and consumes 200 W.
2. NVIDIA H100
The H100 GPU, launched in March 2022, is built on the Hopper architecture, extending the Ampere design. It integrates fourth‑generation Tensor Cores, a new DPX instruction for dynamic programming, increased SM count, enhanced thread‑block clustering, a TMA engine for asynchronous data transfer, a custom Transformer engine, and updated HBM3, PCIe 5.0, and NVLink interfaces. The full chip houses 132 SMs (8 GPCs), 16 966 CUDA cores, 528 Tensor Cores, 50 MB L2 cache, five 16 GB HBM3 modules (80 GB total), 3 TB/s memory bandwidth, and is fabricated in TSMC 4N with ~800 billion transistors, delivering 60 TFLOPS at 1.776 GHz and 700 W TDP.
3. AMD MI250X
AMD split its GPU line into RDNA (graphics) and CDNA (compute); the MI250X is the flagship CDNA 2 processor released in November 2021 and used in the Frontier exascale system. The chip integrates two MI200 GCDs via Infinity Fabric, each GCD containing four Compute Engines with 27‑28 Compute Units, totaling 220 CUs, 16 MB L2 cache, eight 16 GB HBM2E modules (128 GB, 3.2 TB/s bandwidth), and up to six IFLink/PCIe 4.0 interfaces. Fabricated on TSMC N6, it contains 58.2 billion transistors, runs up to 1.7 GHz, peaks at 95.7 TFLOPS (double‑precision), and draws 560 W.
4. Intel PonteVecchio
Intel’s PonteVecchio, announced in August 2021 and shipping in early 2023, targets exascale systems Aurora and ElCapitan. It uses the Xe‑HPC architecture with two stacked dies, eight slices each containing 16 Xe cores (total 128 cores), 144 MB shared L2 cache, eight HBM2E modules (>5 TB/s bandwidth), 16 X‑Link ports (>2 TB/s), and PCIe 5.0. Implemented with a mix of 5 nm, 7 nm, and Intel 7 nm processes, the chip comprises over 100 billion transistors, runs at 1.373 GHz, and exceeds 45 TFLOPS peak performance.
5. Summary
All four processors are built on 7 nm or more advanced processes, feature high transistor densities, and employ advanced 2.5D/3D packaging to integrate high‑bandwidth HBM memory. A64FX runs at 2.2 GHz, while H100, MI250X, and PonteVecchio operate around 1.7 GHz or lower. Except for A64FX (sub‑10 TFLOPS), the others exceed 45 TFLOPS, with MI250X reaching nearly 100 TFLOPS. Power consumption of H100 and MI250X exceeds 500 W; PonteVecchio likely does as well. These architectures illustrate current trends in exascale processor design and provide reference points for domestic high‑performance processor development.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.