Artificial Intelligence 13 min read

How 360 Cloud Platform Implements GPU Passthrough and Docker+MIG for AI Workloads

This article details 360 Cloud Platform's practical implementation of GPU passthrough and Docker‑MIG solutions, covering the underlying principles, host and OpenStack configuration steps, verification methods, and future directions for full GPU virtualization.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How 360 Cloud Platform Implements GPU Passthrough and Docker+MIG for AI Workloads

Background

AI large‑model training is a strategic priority for 360, and GPUs are a critical resource. Directly assigning whole physical machines (each with eight GPUs) leads to waste, so fine‑grained, isolated GPU allocation is required. KVM VMs and containers naturally provide resource partitioning and isolation, making them suitable for delivering GPU resources.

Solution Research

Solution Verification and Deployment

GPU Passthrough Solution

Principle Analysis

GPU passthrough offers good compatibility, low performance overhead, and no extra licensing fees, which is why major cloud providers adopt it.

IOMMU

IOMMU maps Guest Physical Addresses (GPA) to Host Physical Addresses (HPA) using page tables, shielding guests from direct physical memory access.

The mapping rule: when a device issues a DMA request, the guest includes its Source Identifier (bus, device, function). IOMMU uses this identifier to locate the corresponding Context Entry in the Context Table, obtains the page‑table base address, and translates the virtual address to a physical address.

IOMMU functions:

Establishes GPA‑to‑HPA mapping, isolating physical addresses from the guest.

Allows non‑contiguous physical memory to back a contiguous virtual address range, solving the limitation of requiring physically contiguous memory.

VFIO Driver

Virtual Function I/O (VFIO) leverages VT‑d/AMD‑Vi DMA and interrupt remapping to provide near‑native I/O performance while keeping DMA operations safe under IOMMU protection. Userspace processes can directly access hardware via VFIO, even without privileged rights.

VFIO creates an independent IOMMU page table per domain for DMA isolation and uses interrupt remapping to achieve interrupt isolation and direct delivery.

GPU Passthrough Implementation Steps

1. Power on the host, BIOS configures PCI devices and BAR address space.

2. Kernel with IOMMU enabled assigns the device to an IOMMU group.

3. Load the vfio‑pci driver and bind it to the GPU.

4. Start QEMU VM and pass the GPU device through.

5. Inside the VM, QEMU builds the PCI topology and initializes the device.

6. Most instructions are addressed via the IOMMU page table (GPA → HPA); special config registers are accessed via VM exits.

Host Configuration Adjustments

Ubuntu

Enable IOMMU and bind the GPU to vfio‑pci:

<code>vim /etc/default/grub</code><code>GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio-pci.ids=10de:xxxx quiet"</code>

Update GRUB:

<code>update-grub</code>

Dragonfly (Longxin) OS

Enable IOMMU:

<code>grubby --update-kernel="/boot/vmlinuz-`uname -r`" --args="intel_iommu=on"</code>

Configure GPU driver:

<code># cat /etc/modules-load.d/openstack-gpu.conf
vfio_pci</code>
<code># cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:xxxx</code>

Blacklist Conflicting Kernel Modules

<code>vim /etc/modprobe.d/blacklist.conf
blacklist nouveau
options nouveau modeset=0
blacklist xhci_hcd
blacklist nvidia
blacklist nvidia_modeset
blacklist nvidia_drm
blacklist snd_hda_intel
blacklist nvidiafb
blacklist ast
blacklist drm_kms_helper
blacklist drm_vram_helper
blacklist ttm
blacklist drm</code>

Verification

Check IOMMU:

<code>dmesg | grep iommu</code>

Verify GPU bound to vfio driver:

<code>lspci -nn | grep NVIDIA
lspci -s 3e:00.0 -k | grep driver</code>

OpenStack Adjustments

nova‑api configuration

<code>[pci]
alias = {"vendor_id":"10de", "product_id":"xxx", "device_type":"type-PCI", "name":"nvidia-xxx"}</code>

nova‑compute configuration

<code>[pci]
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"xxx"}]</code>

Create trait and flavor

<code>openstack trait create CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3</code>
<code>openstack resource provider trait set --trait CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3 2a5769e2-78fa-4a15-8f36-9b82407c4b56</code>
<code># Example flavor creation
openstack flavor create --vcpus 10 --ram 102400 --ephemeral 800 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:1' v.xxxgn3i-1x.c10g100-1i
openstack flavor create --vcpus 20 --ram 204800 --ephemeral 1600 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:2' v.xxxgn3i-2x.c20g200-2i
openstack flavor create --vcpus 40 --ram 409600 --ephemeral 3200 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:4' v.xxxgn3i-4x.c40g400-4i
openstack flavor create --vcpus 80 --ram 819200 --ephemeral 6500 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:8' v.xxxgn3i-8x.c80g800-8i</code>

Boot VMs with the new flavor:

<code># Example nova boot commands
nova boot --availability-zone nova:hpctur02.aitc.xxx.xxx.net --flavor gpu-flavor --security-groups f9f068b4-f247-4e29-a21d-fc98da18e99f --nic net-id=f0dd296e-dee9-422b-b538-3560fbe145f9 --block-device id=edc2a95c-6e73-4bf8-8891-4bb19d23ca94,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata test_gpu_08</code>

Docker+MIG Solution

Implementation Steps

Create MIG Instance

Enable MIG on GPU 0:

<code>nvidia-smi -i 0 -mig 1</code>

Check MIG mode:

<code>nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv</code>

List MIG profiles:

<code>nvidia-smi mig -lgip</code>

Create MIG instances:

<code>nvidia-smi mig -cgi 14,14,14,14 -C</code>

Verify created instances:

<code>nvidia-smi -L</code>

Install NVIDIA Driver

<code>yum install dkms
curl -SL https://foxi.buduanwang.vip/pan/foxi/Virtualization/vGPU/NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run -O
sh NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run --dkms</code>

Install nvidia‑container‑toolkit

<code># vim /etc/yum.repos.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
enabled=1
gpgcheck=0

[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
enabled=0
gpgcheck=0

# Install
yum install -y nvidia-container-toolkit</code>

Configure Docker Runtime

<code>nvidia-ctk runtime configure --runtime=docker
systemctl restart docker</code>

Pull CUDA Image

<code>docker pull harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8</code>

Run Container and View MIG Devices

<code>sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-a75f333a-f233-5ba4-b4c5-65598abb8f33 harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8 nvidia-smi</code>

Summary and Outlook

360 Smart Cloud Platform has successfully deployed GPU passthrough and Docker+MIG solutions, enabling fine‑grained GPU allocation for AI workloads. Further work is needed to explore VGPU solutions to provide a complete GPU virtualization offering.

DockerGPU virtualizationOpenStackGPU passthroughMIG
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.