Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.
1. Introduction
Since the release of ChatGPT at the end of 2022, a wave of large‑model research has emerged worldwide, and high‑performance networking has become a key factor for distributed training beyond powerful AI chips.
2. High‑Performance Network Overview
Traditional TCP/IP networks suffer from protocol‑stack latency and high CPU load, while RDMA‑based solutions such as RoCE (RDMA over Converged Ethernet) and InfiniBand provide low‑latency, high‑bandwidth, low‑CPU‑consumption communication.
Two main schemes are used in industry: RoCE v2 and InfiniBand. Both have been deployed inside 360 Group to support large‑model projects like 360 智脑 and 360 智绘.
3. Building a RoCE v2 Network in a Cloud‑Native Environment
The cluster uses six NICs per host: two bonded Ethernet NICs for the management plane and four Mellanox NICs (mlx5) for the data plane. Cilium maintains the management network, while Multus CNI, macvlan, and whereabouts provide a second data‑plane network for pods.
Key components include NVIDIA’s network‑operator, which installs the following:
rdmaSharedDevicePlugin:
deploy: true
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775
useCdi: false
resources:
- resourcePrefix: nvidia.com
resourceName: mlx5_0
rdmaHcaMax: 100
vendors: [15b3]
ifNames: [lan2]
- resourcePrefix: nvidia.com
resourceName: mlx5_1
rdmaHcaMax: 100
vendors: [15b3]
ifNames: [lan3]
- resourcePrefix: nvidia.com
resourceName: mlx5_2
rdmaHcaMax: 100
vendors: [15b3]
ifNames: [lan4]
- resourcePrefix: nvidia.com
resourceName: mlx5_3
rdmaHcaMax: 100
vendors: [15b3]
ifNames: [lan5]MacvlanNetwork objects are created for each data‑plane NIC to allocate IP pools via the whereabouts IPAM plugin.
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net-ipam-lan2
spec:
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.4.0/22",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.4.1"
}
master: lan2
mode: bridge
mtu: 1500
networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net-ipam-lan3
spec:
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.8.0/22",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.8.1"
}
master: lan3
mode: bridge
mtu: 1500
networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net-ipam-lan4
spec:
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.12.0/22",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.12.1"
}
master: lan4
mode: bridge
mtu: 1500
networkNamespace: prod
---
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net-ipam-lan5
spec:
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.16.0/22",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.16.1"
}
master: lan5
mode: bridge
mtu: 1500
networkNamespace: prodA sample Volcano job requests the four mlx5 resources and sets NCCL environment variables to enable RoCE communication.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: rdma-test
namespace: prod
spec:
maxRetry: 3
minAvailable: 1
plugins:
pytorch:
- '--master=master'
- '--worker=worker'
- '--port=23456'
policies:
- action: RestartJob
event: PodEvicted
queue: default
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: master
replicas: 1
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net-ipam-lan2,rdma-net-ipam-lan3,rdma-net-ipam-lan4,rdma-net-ipam-lan5
spec:
containers:
- command:
- /bin/bash
- '-c'
- sleep 1440h
env:
- name: NCCL_DEBUG
value: INFO
- name: NCCL_IB_DISABLE
value: '0'
- name: NCCL_NET_GDR_READ
value: '1'
- name: NCCL_IB_HCA
value: mlx5
- name: NCCL_IB_GID_INDEX
value: '5'
- name: NCCL_SOCKET_IFNAME
value: eth0
image: torch
name: pytorch
resources:
limits:
nvidia.com/gpu: '8'
nvidia.com/mlx5_0: '1'
nvidia.com/mlx5_1: '1'
nvidia.com/mlx5_2: '1'
nvidia.com/mlx5_3: '1'
requests:
nvidia.com/gpu: '8'
nvidia.com/mlx5_0: '1'
nvidia.com/mlx5_1: '1'
nvidia.com/mlx5_2: '1'
nvidia.com/mlx5_3: '1'
schedulerName: volcano4. Building an InfiniBand Network in a Cloud‑Native Environment
IB requires dedicated IB switches and NVIDIA UFM for management; no second network plane or MacvlanNetwork objects are needed, and job annotations related to RoCE are omitted.
5. Performance Evaluation
Using the 360 AI development platform to launch MPI‑based distributed training jobs, all‑reduce performance tests show that both RoCE v2 and IB achieve far higher bandwidth than traditional Ethernet, confirming their suitability for training trillion‑parameter models.
All capabilities are integrated into the 360 AI platform, allowing users to create GPU tasks that automatically leverage the high‑performance networks.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.