Artificial Intelligence 13 min read

Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

This article explains why traditional CPU inference and naïve GPU usage are inefficient for recommendation workloads, introduces NVIDIA Multi‑Process Service (MPS) technology, describes VIVO's custom Rust‑based inference engine and deployment strategies, and presents performance and cost benefits along with practical deployment considerations.

DataFunSummit
DataFunSummit
DataFunSummit
Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

Why we chose MPS? In recommendation scenarios, CPU inference cannot meet throughput requirements, while naïve GPU usage suffers from low utilization because many requests have small compute loads that do not fill the GPU.

What is MPS? MPS (Multi‑Process Service) allows multiple processes to share a single GPU by creating separate contexts, enabling concurrent inference without the need for dynamic batching or model conversion, and works transparently at the CUDA driver level.

How we use MPS VIVO built a custom inference engine in Rust that runs multiple processes under a guardian process. One process handles model management (loading from HDFS), while others serve inference via gRPC, exposing metrics to Prometheus. The engine uses TensorFlow's native GPU backend, avoiding costly graph splitting.

Runtime environment The solution runs on a hybrid infrastructure of bare‑metal servers and Kubernetes. On bare metal, each GPU has an MPS control process; in Kubernetes, two strategies are used: sharing the host MPS control process (which reduces isolation) or running a dedicated MPS control per pod to preserve container security.

Performance results Tests with batch size 2000 show the MPS‑based GPU solution achieving 218 QPS with low latency, outperforming both a TensorRT‑based split‑graph approach and a CPU‑only baseline. Cost analysis indicates a 75 % reduction versus CPU inference and a 14 % reduction versus TensorRT.

Key considerations MPS requires Volta‑or‑newer GPUs, recent CUDA drivers, and may need custom GPU kernels for unsupported TensorFlow ops. It is best suited for models that are not too large, have low GPU utilization, and cannot be easily converted to TensorRT.

Suitable scenarios The technology fits workloads where model size is moderate, GPU usage is low, and graph conversion is difficult, such as many recommendation systems.

In summary, MPS provides a practical, high‑performance, and cost‑effective way to scale GPU inference for recommendation services without sacrificing model compatibility.

performance optimizationRustkubernetesGPU inferenceTensorFlowrecommendation systemsMPS
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.