How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling
This article details how SF Tech's EffectiveGPU (EGPU) platform redesigns GPU resource management on Kubernetes, introducing fine‑grained memory and compute partitioning, priority‑based scheduling, Volcano integration, and monitoring pipelines to dramatically improve utilization and reduce hardware costs for AI workloads.
Traditional GPU Usage on Kubernetes and Its Limitations
Deploying native GPUs in a Kubernetes cluster normally requires four steps:
Install the appropriate GPU driver on each node.
Install nvidia-docker2 and set the Docker default runtime to nvidia so containers can access the GPU.
Deploy the nvidia-device-plugin which registers GPU devices with the kubelet.
Submit pods that request nvidia.com/gpu; the scheduler places them on nodes where the plugin reports free GPUs.
Many inference workloads (e.g., TTS, translation, Stable Diffusion, Rerank, Embedding) only use a fraction of a GPU, leading to low utilization. The Kubernetes device‑plugin model only supports whole‑card allocation, so idle compute and memory are wasted.
Three official GPU‑sharing mechanisms exist:
Time‑Slice – time‑multiplexed execution without error isolation.
MPS (Multi‑Process Service) – shares compute cores but still lacks robust monitoring and isolation.
MIG (Multi‑Instance GPU) – splits an Ampere GPU into up to 7 compute instances and 8 memory slices; it requires A100‑class hardware and cannot oversubscribe memory.
All three have drawbacks: no memory oversubscription, limited hardware support, and insufficient isolation/monitoring.
EffectiveGPU (EGPU) Overview
EffectiveGPU is a custom GPU‑partitioning solution that adds fine‑grained sharing, isolation, and oversubscription on any NVIDIA GPU (including V100, T4). Its key capabilities are:
Device sharing & isolation : Allocate compute and memory fractions per pod without modifying the application.
Resource‑efficiency optimization : Scheduler‑driven policies (bin‑pack, spread) maximize utilization.
Seamless compatibility : Works as a drop‑in replacement for the native device‑plugin; existing workloads need no code changes.
Intelligent scheduling : Custom scheduler and Volcano integration provide per‑model, per‑tenant control.
Elastic oversubscription : Both compute (time‑slice quotas) and GPU memory can be oversubscribed; priority queues guarantee QoS for high‑priority jobs.
EGPU Architecture and Core Components
EGPU consists of five components that cooperate to expose virtual GPU resources to the Kubernetes control plane:
egpu‑core : Intercepts CUDA Runtime API calls and forwards them to the driver after applying partitioning, memory swapping, and priority enforcement.
egpu‑device‑plugin : Discovers physical GPUs, reports a configurable number of egpu devices to the API server, and injects required environment variables into pods.
egpu scheduler : A custom scheduler (or a scheduler extension) that places egpu pods onto nodes based on bin‑pack or spread strategies at both node and GPU‑card levels.
egpu‑webhook : Mutating webhook that detects egpu resource requests in a pod spec and adds the EGPU scheduler name to the pod’s spec.schedulerName.
volcano‑egpu‑device‑plugin : Extends the device‑plugin for the Volcano batch scheduler, exposing memory partitioning and priority fields to Volcano queues.
The workflow is:
The egpu‑device‑plugin registers virtual devices with the API server.
A pod that requests egpu resources triggers the egpu‑webhook, which sets the scheduler.
The chosen scheduler (either the custom EGPU scheduler or Volcano) selects a node that matches the requested GPU model and resource fractions.
Resource Specification and Scheduling Semantics
After EGPU is installed, node status shows the reported egpu devices. Pods can request the following fields in their resource spec:
Number of egpu units.
Absolute memory per unit (e.g., 4000Mi).
Memory as a percentage of the physical GPU (e.g., 50% of a 32 GiB card = 16 GiB).
Compute share as a percentage (e.g., 30% of the GPU’s SM capacity).
Priority level ( high or low). High‑priority pods pre‑empt low‑priority ones; the latter are paused until resources become available.
Compute limits are of two types:
Strong limit : A pod cannot exceed its allocated percentage even if the GPU has idle capacity.
Weak limit : When other pods are idle, a pod may use any free compute up to 100 %; when multiple pods are active, the share is divided equally according to the requested percentages.
Integration with Volcano Scheduler
When using Volcano, the egpu‑device‑plugin reports GPU devices together with their model names (e.g., A100, H20). This enables:
Model‑aware node selection so that a pod requesting A100 is scheduled only to nodes that expose that model.
Queue‑based tenant isolation: each tenant can be assigned a Volcano queue with its own quota of egpu resources.
Clear visibility of which GPU model each workload runs on, eliminating the need for separate node‑affinity rules.
Monitoring and Metrics Collection
EGPU exposes metrics at two layers:
Scheduler layer : Per‑GPU‑card allocation of memory and compute, plus per‑pod usage statistics.
Device‑plugin layer : Real‑time utilization (memory and SM usage) compatible with dcgm-exporter, providing both node‑level and pod‑level metrics.
These metrics are scraped by Prometheus, visualized in Grafana, and forwarded via remote_write to a prometheus‑kafka‑adapter. Downstream, a Flink job writes the data into Hive for long‑term reporting and cost‑optimization analysis.
Cost‑Reduction Case Studies
Training + Inference co‑location : A low‑priority training job shares a GPU with a high‑priority inference job. During daytime inference spikes, the inference pod pre‑empts the training pod; at night the training pod consumes the idle compute.
Partitioning low‑utilization jobs : A job with a 40 GiB memory quota but low actual usage is partitioned based on its observed maximum memory, freeing the remainder for other services.
Memory oversubscription for time‑shifted workloads : Two services with complementary active periods (day vs. night) are placed on the same GPU. When one service is idle, its data is swapped to system memory, allowing the active service to use the full GPU memory without performance loss.
Key Q&A Highlights
Dual‑dimensional oversubscription : Compute is time‑sliced according to quota limits; memory oversubscription relies on a unified memory manager that swaps data between GPU memory and host memory.
Volcano advantages : Direct GPU‑model awareness, tenant‑level queue isolation, and elimination of extra node‑selection policies.
Impact on model accuracy and performance : Accuracy is unchanged. Performance may degrade when multiple oversubscribed models are simultaneously active because swapped data incurs slower host‑memory access, but with staggered workloads the impact is minimal.
Heterogeneous hardware support : Separate device‑plugins are provided for each hardware type (GPU, NPU, etc.), allowing the scheduler to treat them uniformly.
Scheduling overhead : The underlying scheduler remains the standard kube‑scheduler with an added EGPU plugin, so overhead is comparable to native scheduling.
Conflict resolution : Priority levels ensure that high‑priority inference tasks pre‑empt low‑priority training tasks; when the high‑priority task finishes, the scheduler swaps the necessary data back into GPU memory and resumes the low‑priority job.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
