Artificial Intelligence 11 min read

Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions

The article describes how a large‑scale dispatch system was re‑engineered with NVIDIA TritonServer to unify GPU‑accelerated operations‑research kernels and deep‑learning models, detailing a three‑stage architecture (in‑process, cross‑process, cross‑node), the performance, stability and memory challenges addressed, and future plans for heterogeneous GPU scaling.

Meituan Technology Team

May 8, 2025

Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions

This article presents the design and implementation of a mixed Operations Research (OR) and Machine Learning (ML) inference framework for a large‑scale dispatch system. The framework combines GPU‑accelerated CUDA kernels for OR algorithms (e.g., route planning) with deep‑learning models such as the Estimated Time of Arrival (ETR) model, addressing performance, stability, and scalability challenges that arise when moving from CPU‑based remote inference to GPU‑local inference.

Background : The dispatch system must assign orders to couriers under strict latency constraints. OR algorithms and ML models dominate compute time, accounting for over 60% of the workload. Deploying these workloads on remote CPUs would require tens of thousands of servers, leading to high operational cost and maintenance difficulty.

Problems :

Performance: Lack of unified task scheduling leads to fragmented workloads, imbalanced GPU utilization, and frequent context switches, causing noticeable latency increases.

Stability: GPU tasks can trigger CUDA exceptions (e.g., address overflow, ECC errors) that require a full process restart, resulting in 10–15 minutes of downtime.

Scalability: Each model or algorithm pre‑allocates 3–5 GB of GPU memory; with 24 GB per GPU, memory becomes a bottleneck as algorithm complexity grows.

Solution Approach : After evaluating TFServing, TorchServe, and TritonServer, the team selected NVIDIA’s open‑source TritonServer for its broad model support and extensibility. The solution includes:

Using TritonCore’s C‑API to integrate TensorRT Backend for deep‑learning models (e.g., ETR).

Developing a custom OR Backend to incorporate hand‑written CUDA kernels for route‑planning algorithms.

Designing a three‑phase architecture evolution: in‑process calls, cross‑process calls, and cross‑node calls.

Architecture Evolution :

In‑process Invocation : Integrated TritonCore into the Java‑based dispatch process via JNI, adding OR Backend, monitoring via Meituan’s Raptor platform, and fixing memory leaks with Valgrind and NVIDIA support.

Cross‑process Invocation : Replaced JNI with gRPC + shared memory (SHM) to separate inference from business logic, reducing failure recovery time from >10 minutes to ~10 seconds. Implemented a shared‑memory pool to support >6000 QPS with minimal overhead.

Cross‑node Invocation : Added multi‑node routing using a Power‑of‑Two‑Choices load‑balancing algorithm, RDMA‑based low‑latency data transfer, and an MRPC fallback. This achieves 18 % higher throughput and 25 % lower 99th‑percentile latency.

Future Outlook : The framework will adapt to heterogeneous GPU fleets, implement hardware‑aware deployment, and explore multi‑level caching and distributed inference to support larger search spaces and large‑model combinatorial optimization.

References include performance results of the hand‑written CUDA route‑planning algorithm (14.8× speedup over Java) and links to TritonServer documentation and related industry articles.

performance optimization machine learning Scalability GPU Inference or TritonServer

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.