Scaling Deep Learning Model Serving: High‑Concurrency, Low‑Latency Solutions
This article examines the challenges of deploying dozens of deep‑learning models at Zuoyebang and compares three serving architectures—Gunicorn + Flask + Transformers, Tornado + PyTorch, and Tornado + Triton—highlighting performance trade‑offs and presenting a final high‑concurrency, low‑latency solution in production.
Deep learning is a neural‑network‑based machine learning approach that is rapidly becoming an effective tool for tasks such as text classification and recommendation. Deploying trained models to applications poses challenges such as framework differences, scarce compute resources, and lack of standard implementations.
Zuoyebang’s large‑scale business requires deploying many deep‑learning models with high concurrency, low latency, numerous models, fast updates, and many business parties. Specific problems include tight coupling between data‑preprocessing modules and model versions, version conflicts, and GPU resource waste when loading all models.
Model Deployment Options
1. Gunicorn + Flask + Transformers
Transformers provides easy model training, loading and inference. Flask is simple but its built‑in server cannot meet high‑concurrency, low‑latency needs, so we place Flask behind Gunicorn’s WSGI workers to improve stability and concurrency.
Architecture diagram:
Key points: load models via Transformers with an LRU cache limited to eight models in memory; build a Flask prediction service handling preprocessing, inference and post‑processing; launch Flask with Gunicorn workers for asynchronous request handling.
Issues observed: each worker spawns >100 threads causing resource contention; health‑check timeouts under heavy load lead to pod eviction and increased latency; Transformers inference latency is high; LRU cache forces model reloads, adding delay. With six serverless pods (16 CPU + 16 GB each) the setup reaches about 150 QPS.
2. Tornado + PyTorch
Tornado is an asynchronous, non‑blocking Python web server that handles high traffic robustly. This solution replaces Transformers with direct PyTorch inference to avoid redundant Trainer overhead and isolates each model into its own sub‑service, removing dynamic loading and reducing resource contention.
Architecture diagram:
Key steps: each model is wrapped by a Tornado sub‑service (preprocess, inference, post‑process); a dispatcher routes requests to the appropriate sub‑service; results are aggregated and returned. Async/await enables efficient concurrency.
Remaining problems: PyTorch still creates many threads per prediction; managing dozens of sub‑services is complex; no dynamic batching, limiting hardware utilization.
3. Single‑Model Performance Test
We benchmarked single‑model deployments using ONNX and TorchScript formats with Tornado and Triton Inference Server. ONNX provides a framework‑agnostic model file; Triton offers features such as dynamic batching, sequence handling, and model repository management.
All services run in Docker with 16 CPU cores; Locust generates load. Results show that without preprocessing, Triton + TorchScript yields the best performance thanks to dynamic batching. However, integrating preprocessing into Triton reduced benefits due to serial preprocessing.
4. Final Solution: Tornado + Triton
The chosen architecture packages each preprocessing module as a separate Python package to avoid version conflicts, and exposes multiple endpoints (e.g., xxx_single_predict and xxx_batch_predict) for different business needs.
Architecture diagram:
Tornado receives requests, performs preprocessing, asynchronously calls Triton for inference, then post‑processes and returns results. Performance tests with 21 models on a 16‑core Triton server and Locust traffic demonstrate that this setup achieves the best trade‑off between concurrency, latency, and manageability.
Conclusion and Outlook
Key challenges remain: ~100 models with frequent updates, strong coupling between preprocessing and model versions causing conflicts, and strict high‑concurrency, low‑latency requirements on CPU‑only environments.
Through several experiments we identified a deployment pattern—Tornado + Triton with external preprocessing—that satisfies our complex scenario without GPU resources. Future work includes improving ONNX performance in Triton, decoupling preprocessing from model versions, and exploring more robust version‑alignment strategies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
