Microsecond-Scale GPU Preemption Enables Concurrent Real-Time DNN Inference
REEF introduces a reset‑based preemption mechanism and dynamic kernel padding to achieve microsecond‑scale GPU kernel preemption, enabling concurrent real‑time and best‑effort DNN inference with only 2 % added latency for real‑time tasks while boosting overall throughput by up to 7.7×, as demonstrated on the DISB benchmark.
Background
In modern intelligent systems, multiple DNN models run concurrently on a GPU to satisfy different functions. Real‑time tasks such as obstacle detection in autonomous driving require microsecond‑level latency, while best‑effort tasks like fatigue monitoring have looser latency requirements.
Existing GPU scheduling approaches either leave GPU resources under‑utilized or incur high latency for real‑time tasks.
The paper proposes REEF, a GPU‑accelerated DNN inference service that achieves microsecond‑scale kernel preemption and controllable concurrent execution through a reset‑based preemption scheme and dynamic kernel management.
Current GPU Scheduling Strategies
Sequential execution: Non‑preemptive; a real‑time task must wait for the currently running best‑effort task to finish, leading to high latency and low throughput.
Block‑level preemption: Interrupts at block boundaries; preemption latency depends on the remaining execution time of the current block, supports only one kernel at a time, and can starve best‑effort tasks.
Multiple GPU streams: Executes tasks on separate streams, but real‑time tasks may suffer longer execution due to resource contention.
Characteristics of DNN Inference on GPUs
Idempotence: Most kernels are stateless algebraic operations that produce the same output for identical inputs, so killing and restarting a kernel does not affect correctness.
Massive kernels: Inference workloads launch many kernels; preempting unfinished kernels would incur large overhead if each kernel were handled individually.
Latency predictability: Kernel execution times are predictable because they lack branches and state.
Varied parallelism: Parallelism changes with input size, causing fluctuating compute‑unit demand.
REEF Design
Overview
REEF operates in an offline phase (compiling and loading models, profiling kernel resource and time requirements) and an online phase with two modes: normal mode (no real‑time task) and real‑time mode (real‑time task preempts best‑effort task).
The scheduling example shows that when a real‑time task arrives, REEF immediately switches to real‑time mode, aborts the current best‑effort task using reset‑based preemption, and runs both tasks in parallel via Dynamic Kernel Padding. After the real‑time task finishes, the system returns to normal mode.
Reset‑Based Preemption
Leveraging kernel idempotence, REEF kills the currently executing kernel and later restores it.
Preemption must reset kernels in three locations: Host Queues (HQs), Device Queues (DQs), and Compute Units (CUs).
Reset HQs: The GPU runtime maintains a linked list (HQ) for each stream; REEF dequeues and frees all kernels in the HQ.
Reset DQs: Instead of evicting kernels from the DQ (which could cause data races), REEF adds a preemption flag at the kernel entry; when the flag is true, the kernel exits immediately after entering the CU.
Reset CUs: By modifying the driver’s kernel‑killing function, REEF can kill a kernel on a CU while preserving its state.
Dynamic Kernel Padding
REEF first allocates sufficient resources to the real‑time task, then assigns remaining resources to best‑effort tasks.
Instead of compile‑time kernel fusion, REEF merges kernels at runtime using function pointers that dynamically jump to the next kernel.
During the offline phase, REEF records kernel execution times; the online scheduler selects only best‑effort kernels whose execution time is shorter than the real‑time task’s deadline. To keep overhead low, REEF implements global function pointers and proxy kernels, reducing the pointer‑dispatch cost to about 1 %.
Evaluation
The authors introduce a real‑time DNN inference benchmark (DISB) and compare REEF against three baselines:
RT‑Only: executes only real‑time tasks, providing the best end‑to‑end latency.
SEQ: non‑preemptive sequential scheduling that prioritises real‑time tasks at submission.
GPUStreams: concurrent execution using multiple GPU streams.
Results show that REEF adds only 2 % latency to real‑time tasks compared with RT‑Only, while increasing overall throughput by 1.14× to 7.7×.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
