Industry Insights 12 min read

How Shanghai Jiao Tong University Built China’s First Campus‑Scale ARM HPC Cluster with Huawei Kunpeng

This article details Shanghai Jiao Tong University's design and deployment of the nation’s first campus‑level high‑performance computing cluster based on Huawei Kunpeng 920 ARM processors, covering background, user challenges, unified storage, network topology, containerized software delivery, and performance validation with LAMMPS and GATK.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How Shanghai Jiao Tong University Built China’s First Campus‑Scale ARM HPC Cluster with Huawei Kunpeng

Background

China has achieved notable milestones in high‑performance computing (HPC), with Tianhe‑2 and Sunway TaihuLight topping the TOP500 list and winning the Gordon Bell awards. The 14th Five‑Year Plan emphasizes domestic replacement, mandating that next‑generation E‑class supercomputers use indigenous processors.

Motivation and Challenges

Despite these advances, Chinese universities lacked campus‑level platforms built on domestic processors due to three main issues:

Significant workflow differences between ARM‑based platforms and existing x86 clusters, leading to user unfamiliarity.

Most mainstream scientific software is compiled for x86, requiring recompilation and adaptation for ARM.

Many applications have not been performance‑tuned for ARM, resulting in uncertain execution speed.

Implemented Solutions

Mounted a unified parallel file system and job scheduler across heterogeneous compute resources, providing a consistent user experience.

Leveraged containers to rapidly deploy ARM‑optimized HPC applications, offering pre‑compiled software as modules and images.

Performed correctness verification and performance tuning for the pre‑compiled applications.

System Design

Network Topology

The ARM cluster connects to an InfiniBand (IB) switch fabric, which is bridged via LNet routers to the existing Omni‑Path network of the x86 cluster, enabling seamless data sharing across heterogeneous resources.

The IB fabric consists of five 40‑port switches and three router nodes, forming a fat‑tree topology that delivers up to 10 TB/s aggregate bandwidth between compute nodes and 11 TB/s between access and core layers, ensuring 100 Gbps per‑pair communication.

Shared File System

The campus platform uses the Lustre parallel file system, providing a POSIX‑compatible, highly available, and scalable storage layer for all clusters.

Job Scheduler

SLURM, an open‑source, fault‑tolerant scheduler, manages resource allocation, job launch, monitoring, and queue arbitration across the heterogeneous environment.

Mounting Lustre on the ARM Cluster

Mounting involves two steps:

Compile and install the Lustre client (v2.12.4) on a customized CentOS 7.6 ARM system, ensuring kernel and IB driver compatibility.

Configure LNet routing: assign distinct LNET tags to the ARM nodes, then set up matching routes on storage servers, ARM nodes, and router nodes to interconnect Omni‑Path and InfiniBand networks.

After these steps, the ARM cluster successfully accesses the shared Lustre file system, forming a unified data foundation.

Performance Tuning and Validation

Two representative workloads—LAMMPS (molecular dynamics) and GATK (genomic analysis)—were selected for evaluation because they accounted for 35 % of CPU usage on the university’s x86 cluster in 2020.

LAMMPS Results

Using the EAM and LJ benchmarks (864 k atoms, 5 000 steps, NVE ensemble), the ARM node achieved roughly twice the performance of a standard Intel Xeon without Intel‑specific acceleration, maintaining a 1.5× advantage at 16 nodes. With Intel’s User‑Intel acceleration on x86, the ARM cluster reached about 60 % of the accelerated x86 performance.

GATK Results

Testing GATK 4.2 on the Broad Institute workflow showed a noticeable slowdown on ARM because the HaplotypeCaller module lacks Intel’s GKL acceleration. Other tools (MarkDuplicates, BQSR) performed at 70 %–50 % of the x86 baseline.

Conclusions

The new network topology enables the ARM cluster to share the same Lustre file system with existing x86 and GPU clusters, providing transparent data access. Containerization (Singularity) delivered over 30 pre‑compiled HPC applications, and performance tuning brought key workloads to 60 %–70 % of the best x86 results. The ARM cluster entered a trial phase in summer 2021, achieving over 70 % average monthly utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationARMHPCInfiniBandKunpengLustreSLURM
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.