Nimble: A Lightweight Parallel GPU Scheduler Boosting Deep Learning Performance

The article analyzes how Nimble reduces GPU scheduling overhead and enables parallel execution through ahead‑of‑time scheduling and automatic multi‑stream assignment, achieving up to 22.3× inference speedup over PyTorch and significantly improving GPU utilization for deep learning workloads.

Code DAO
Code DAO
Code DAO
Nimble: A Lightweight Parallel GPU Scheduler Boosting Deep Learning Performance

Deep learning frameworks run models on GPUs to accelerate inference and training, automatically launching operators that consist of GPU kernels and memory operations. The frameworks first convert the neural network into an operator graph, then schedule kernels based on input tensor shapes—a process known as GPU task scheduling.

Existing frameworks suffer from high scheduling overhead, leaving GPUs idle for a large portion of the runtime. Experiments show TensorFlow leaves GPUs idle up to 71% of the time and PyTorch up to 91%. A custom C++ program that reuses the same GPU tasks but eliminates most scheduling work runs 2.37× faster than PyTorch, confirming that scheduling overhead is the main source of GPU idle time.

GPUs can execute thousands of threads in parallel, but current frameworks typically submit all kernels to a single GPU stream, limiting parallelism. Two approaches exist: maximizing kernel parallelism within one stream, which can be constrained by kernel‑level parallelism, or running multiple tasks across several streams, which can fully exploit GPU capacity. The latter offers better utilization but is hard to implement because existing runtimes are designed for a single‑stream model.

To address these inefficiencies, Nimble is introduced as a deep‑learning execution engine with three design goals: minimal scheduling overhead, parallel execution, and no need to redesign the framework runtime.

Ahead‑of‑Time (AoT) scheduling : Nimble performs scheduling only once before runtime. When an input arrives, it skips the scheduling phase and directly submits GPU tasks using a pre‑computed schedule. This is analogous to loop‑invariant code motion and dramatically shortens the runtime loop. For static neural networks, the set of GPU tasks is fixed; Nimble records an execution trace during an initialization run, storing task plans and memory allocations. At inference time, it replays this trace, eliminating per‑input scheduling work.

Automatic multi‑stream execution : Nimble assigns GPU tasks to multiple streams and inserts synchronization operators to preserve dependencies. It computes a stream‑allocation mapping from graph nodes to GPU streams, aiming to (1) maximize logical concurrency and (2) minimize synchronization points. The rewritten graph includes stream markers and barrier operators, enabling each runtime GPU task to execute without additional overhead.

Evaluation : Nimble was implemented on PyTorch and evaluated on an NVIDIA V100 GPU. For inference, Nimble achieved up to 22.3× speedup over vanilla PyTorch and 2.8× over TensorRT. Multi‑stream execution provided up to 15× logical concurrency, surpassing manual stream allocation and synchronization. For training, the impact of scheduling overhead is smaller on large‑batch workloads, so speedup is limited; however, on small‑scale training such as CIFAR‑10, Nimble still delivers noticeable gains.

In summary, while deep‑learning frameworks abstract GPU complexity and have driven many advances, their runtime scheduling introduces significant overhead. Nimble captures the core DL computation, applies ahead‑of‑time scheduling, and automatically parallelizes tasks across multiple streams, thereby reducing overhead and improving GPU utilization without requiring changes to existing frameworks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningGPU schedulingmulti-streamparallel executionperformance accelerationahead-of-time
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.