Artificial Intelligence 12 min read

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

This article presents a detailed benchmark of four Transformer models of varying sizes trained on the high‑end Inspur NF5488A5 GPU server, compares its NVSwitch‑based interconnect with a PCIe‑based system, and analyzes the impact of model scale, tensor parallelism, and hardware bandwidth on training efficiency.

Architects' Tech Alliance

Aug 31, 2022

Performance Evaluation of Transformer Models on the Inspur NF5488A5 GPU Server

Transformer architectures have become central to deep learning, especially in natural language processing, with models such as BERT, GPT, and large‑scale variants like GPT‑3 and the Chinese "源1.0" reaching hundreds of billions of parameters.

The author obtained an industry‑leading GPU server, the Inspur NF5488A5, equipped with dual AMD EPYC 7742 CPUs and eight NVIDIA A100 SXM4 GPUs (40 GB each), providing 320 GB total GPU memory, 5 PFLOPS FP16 compute, and 16.312 TB/s memory bandwidth, making it well‑suited for Transformer training.

Four GPT‑2‑style Transformer models (A, B, C, D) with hidden sizes from 1920 to 4096, attention heads from 15 to 32, and parameter counts from 1.16 B to 86.7 B were built using the Megatron‑LM framework with activation checkpointing and a sequence length of 1024. Batch size was fixed at 16 for all tests.

Performance results on the NF5488A5 show iteration times ranging from 1108 ms (model A, single‑GPU) to 1385 ms (model D, 8‑GPU tensor parallelism). Estimated full‑training durations for 3000 Billion tokens increase from 19.4 days (model A) to 196.4 days (model D). Single‑GPU compute peaks reach up to 142 TFLOPS, about 45.5 % of the A100’s theoretical FP16 peak.

A comparative test using a PCIe 4.0‑connected GPU server (without NVSwitch) demonstrated that the NF5488A5 delivers at least a 4× speedup across all model sizes, highlighting the advantage of NVSwitch’s high‑bandwidth interconnect for tensor‑parallel training.

The study concludes that a single NF5488A5 can meet the training demands of hundred‑billion‑parameter Transformers, though distributed training across multiple servers is still required for larger models. The high‑bandwidth NVSwitch and InfiniBand networking substantially improve communication efficiency, making the platform a strong choice for large‑scale AI workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer DeepSpeed Megatron-LM performance benchmarking NVSwitch GPU server

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.