RecFlow Breaks DLRM Inference Bottleneck with Fine-Grained GPU Parallelism
RecFlow, a new inference engine from Beijing University of Posts and Telecommunications and Meituan, tackles the resource mismatch of DLRM models by coordinating embedding and DNN operators at the intra‑SM level and introducing interference‑aware adaptive scheduling and incremental batching, achieving up to 9.34× higher throughput on RTX 3090.
In modern recommendation services, deep learning recommendation models (DLRM) combine memory‑bound embedding lookups with compute‑bound dense neural network (DNN) layers. Existing inference systems execute these two operator types serially, causing GPU resources to be under‑utilized because only one resource type (compute or memory bandwidth) is fully used at any moment.
RecFlow observes that embedding and DNN operators exhibit opposite resource usage patterns within a streaming multiprocessor (SM): DNN kernels occupy compute cycles while leaving memory bandwidth idle, whereas embedding kernels heavily stress memory bandwidth with little compute. By interleaving lightweight embedding I/O into the idle periods of DNN execution, RecFlow achieves fine‑grained intra‑SM resource coordination.
Experimental profiling shows that this coordination raises HBM bandwidth utilization from 36 % to 82 % without noticeably degrading DNN compute density, effectively “inserting” embedding work into otherwise idle compute slots.
Because batch size fluctuates at runtime, the interference pattern between operators also changes. RecFlow therefore adopts an interference‑aware adaptive parallelism strategy that consists of:
Multi‑dimensional offline profiling to build a knowledge base of latency and throughput gains for different batch‑size and operator‑mix configurations.
Runtime dynamic scheme selection, where the scheduler picks the least‑interfering configuration from the pre‑computed set based on current load characteristics.
Fine‑grained tail‑effect optimization that injects plug‑in embedding kernels or rebalances task boundaries during the later stages of DNN execution, when thread‑level wave quantization would otherwise leave resources idle.
Beyond operator‑level coordination, DLRM inference suffers from a strict data dependency: the top‑DNN stage must wait for all embedding lookups to finish. RecFlow introduces an incremental batching pipeline that overlaps these stages across batch boundaries:
Pre‑overlap: while the current batch’s top‑DNN runs, the scheduler proactively pulls newly arrived requests and launches their embedding lookups as micro‑batches.
Segmented pipeline: the top‑DNN is split into sub‑modules, allowing embedding work from new requests to interleave with different DNN phases.
Zero‑queue latency: requests enter the processing flow immediately upon arrival, eliminating the need to wait for a full batch to be assembled.
Performance evaluation on an NVIDIA RTX 3090 with real‑world production workloads shows that RecFlow’s end‑to‑end throughput exceeds the state‑of‑the‑art DLRM inference system RecFlex by 1.13× and outperforms native PyTorch by 9.34×. The system also reduces tail latency under high load while maintaining the same service‑level objectives.
RecFlow demonstrates that deep hardware‑aware analysis and fine‑grained scheduling can substantially alleviate resource conflicts in large‑scale recommendation inference, offering a practical path for accelerating AI workloads in production.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
