Cloud Native 24 min read

How HBox Boosts GPU Utilization with Multi‑Pool and NUMA‑Aware Scheduling

The HBox scheduling platform tackles large‑scale AI cluster challenges by introducing a three‑pool resource model, priority‑based preemptive scheduling, network‑topology and NUMA‑aware dispatch, and GPU virtualization techniques like MIG and vGPU, dramatically improving GPU utilization, SLA guarantees, and overall cluster efficiency.

360 Zhihui Cloud Developer

Dec 30, 2025

How HBox Boosts GPU Utilization with Multi‑Pool and NUMA‑Aware Scheduling

As AI model training scales from thousands to tens of thousands of GPUs, the primary bottleneck shifts from insufficient GPU count to inefficient compute scheduling and resource utilization. In production, GPU utilization often stays below 60%, high‑value jobs are delayed by lower‑priority tasks, NCCL communication suffers from topology effects, and NUMA‑aware scheduling is limited to single nodes.

To address these issues, the 360AI development platform built HBox, a high‑performance, stable, and easy‑to‑use scheduling system that supports clusters of ten‑thousand GPUs and beyond.

HBox Scheduling Platform Overview

Compute pooling

SLA‑guaranteed dispatch

Network‑topology‑aware scheduling

GPU virtualization integration

NUMA‑affinity scheduling

Support for domestic chips

Automatic fault detection

By classifying resources into three pools—public, pooled, and exclusive—HBox matches workload requirements (e.g., latency‑sensitive inference, batch training, testing) with appropriate isolation and cost efficiency. The three‑pool model raises average GPU utilization from 30‑60% to 70‑90% while improving SLA stability and reducing operational costs.

Priority‑Based Preemptive Scheduling

Each department gets an independent queue, and tasks are assigned one of three priority levels:

High priority: can preempt lower‑priority tasks and cannot be preempted.

Medium priority: guaranteed resources, no preemption.

Low priority: may be preempted.

In practice, critical business jobs bypass queues, development notebooks reuse fragmented GPU capacity, and SLA predictability improves markedly.

Network‑Topology‑Aware Scheduling

HBox adds a network topology detector that uses NVIDIA UFM to collect InfiniBand switch and port information, builds a global communication tree, and a scheduler that prefers placing pods of the same job on the most optimal network path. Three policies are offered:

none – no topology awareness.

bestEffort – try to allocate the best communication nodes.

singleSwitch – all pods must reside on the same switch; otherwise the job is rejected.

Real‑world tests show a 20% reduction in NCCL latency and significantly higher scheduling stability.

GPU Virtualization Strategies

HBox integrates NVIDIA MIG and HAMi vGPU to provide layered GPU sharing:

Time‑Slicing : container‑level time sharing without hardware changes; low isolation, possible OOM.

MPS : process‑level sharing with per‑process memory limits and better performance.

MIG : hardware‑level partitioning into up to seven independent instances, offering strong isolation and predictable performance.

vGPU (HAMi) : hypervisor‑based virtual GPUs with fine‑grained (1% compute, MB memory) slicing, strong isolation, and broad hardware compatibility.

Recommended usage:

Notebook development – HAMi.

Lightweight inference – HAMi.

Strict‑SLA inference – MIG.

Large‑scale training – exclusive GPU.

NUMA‑Aware Scheduling

For workloads that heavily exchange data between GPU memory and CPU, placing resources across NUMA nodes degrades performance. HBox currently supports NUMA‑aware scheduling on a single node by configuring kubelet policies (CPU manager static, topology manager best‑effort) and enforcing guaranteed QoS pod specifications. Future work extends NUMA awareness to the cluster level, scoring nodes based on NUMA locality during scheduling.

Flexible GPU‑CPU Ratio Scheduling

Traditional AI clusters allocate a fixed CPU share per GPU, leaving many CPU cores idle. HBox plans to enable flexible GPU‑CPU pairing so that idle CPU cycles on GPU nodes can run data‑preprocessing or other CPU‑heavy tasks, improving overall cluster utilization and shortening job turnaround times.

Support for Domestic Chips

HBox also supports Huawei Ascend chips (910B, 310P) by exposing HCCS topology to the scheduler and using the open‑source ascend‑for‑volcano plugin for affinity‑aware placement.

Stability and Fault‑Detection Framework

HBox implements a comprehensive monitoring system (qihoo‑smi) that watches GPU health, NVLink status, Mellanox NIC health, kernel modules, and K8s control‑plane connectivity. Detected faults trigger automatic node cordon, alerts, and self‑healing actions such as module reloads or service restarts. Alert data is stored in Elasticsearch for post‑mortem analysis.

Through the combination of three‑pool resource isolation, priority preemption, network‑topology and NUMA awareness, MIG/vGPU virtualization, and flexible GPU‑CPU scheduling, HBox delivers a balanced solution that maximizes resource utilization, ensures SLA compliance, and provides a stable foundation for large‑scale AI training and inference workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes GPU Scheduling GPU virtualization NUMA resource pooling AI clusters

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.