Industry Insights 14 min read

Inside Los Alamos’ Venado Supercomputer: Architecture, Performance, and HPC Trends

The Venado supercomputer, unveiled at Los Alamos, combines Nvidia Grace CPUs, Hopper GPUs, HPE Slingshot interconnects, and massive memory bandwidth to achieve a 15.6‑petaflop FP64 peak, illustrating the evolving balance between CPU and GPU workloads in modern high‑performance computing.

Architects' Tech Alliance

Jul 23, 2024

Inside Los Alamos’ Venado Supercomputer: Architecture, Performance, and HPC Trends

Background and Motivation

Los Alamos National Laboratory, founded in 1943 as part of the Manhattan Project, has a long history of using increasingly powerful supercomputers for nuclear weapons stewardship and scientific simulation. Recent installations include the 1.47‑billion‑dollar Trinity system (2015) and its successor Crossroads (2023), both based on Intel Xeon CPUs and high‑speed Omni‑Path or Slingshot networks.

Venado System Overview

The newly commissioned Venado machine is an experimental platform built on a limited budget to explore hardware and software research. Its name comes from a peak in the Sangre de Cristo range of New Mexico. Hewlett Packard Enterprise is the primary integrator, and the system deliberately avoids Nvidia’s NVLink‑based GPU‑shared‑memory pods.

CPU/GPU Architecture and Interconnect Choice

Venado employs a mixed‑architecture design: 80 % of FLOPs are expected to run on GPUs (primarily Nvidia Hopper GPUs) and 20 % on CPUs (Nvidia Grace CPUs). This split is unusual compared with typical HPC systems where GPUs deliver 95‑98 % of FLOPs. The decision reflects the workload’s heavy reliance on sparse, irregular memory accesses that benefit from high per‑core memory bandwidth rather than raw compute throughput.

The system uses HPE Slingshot 11 interconnects at 200 Gb/s per port, a cost‑effective alternative to Nvidia Quantum‑2 InfiniBand (400 Gb/s). The Slingshot network also hosts a Lustre parallel storage cluster, with plans to evaluate DeltaFS and other filesystems.

Grace‑CPU Details

Each Grace CPU provides 16 LPDDR5 memory controllers, delivering a total of 546 GB/s bandwidth per CPU and 512 GB of memory (the commercial version ships with 480 GB and 500 GB/s). Two Grace CPUs are linked via a 900 GB/s NVLink chip‑to‑chip (C2C) connection, and each CPU can also connect to Hopper GPUs equipped with 80‑96 GB HBM3 (or 141 GB HBM3E) memory.

Performance Estimates

Based on a rough 80/20 FLOP split, the design calls for 3,125 Grace‑Hopper nodes and 1,500 Grace‑Grace nodes, yielding a total of 2,560 Grace‑Hopper nodes and 920 Grace‑Grace nodes in the final configuration. This translates to approximately 316,800 Grace cores and a peak FP64 performance of 15.62 petaflops. The system includes 2 PB of main memory (LPDDR5) and a total memory bandwidth of 2.1 PB/s.

GPU‑side performance comprises 2,560 Hopper GPUs, delivering 85.76 petaflops from vector cores and 171.52 petaflops from tensor cores (assuming FP64 tensor‑core efficiency of 92 %). Each Hopper GPU is equipped with 96 GB of HBM3, giving the cluster 240 TB of GPU memory and an aggregate bandwidth of 9.75 PB/s. Approximately 81 % of the total memory bandwidth resides on the GPUs, while the remaining 19 % is provided by the Grace CPUs.

Implications for HPC Workloads

The Venado design highlights the importance of memory bandwidth per core and cost‑effective interconnects for workloads dominated by sparse, irregular data access patterns, such as large‑scale multiphysics simulations that can require months of runtime on half a machine. By balancing CPU memory bandwidth with GPU compute power, Venado aims to improve performance‑per‑dollar for these demanding applications.

Current Status

According to system leader Gary Grider, Venado has been installed and is operational, with acceptance testing expected within two months and a full suite of applications slated for deployment by July.

Source: Semiconductor Industry Observation

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CPU GPU HPC Memory Bandwidth supercomputer Grace Hopper Slingshot

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.