Artificial Intelligence 10 min read

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

High Availability Architecture
High Availability Architecture
High Availability Architecture
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

Background : AI algorithm complexity and resource consumption are rising at Bilibili, with computer vision, NLP, and speech workloads demanding efficient inference and deployment across hundreds of scenarios.

Challenges and Goals : Rapid traffic growth pressures response time and QPS; large language models increase model complexity; frame‑level video processing (e.g., OCR handling >1 billion 720p images per day) stresses inference services. The objectives are to increase inference throughput, reduce resource growth, improve response time, and enable new business scenarios.

InferX Inference Framework : A generic framework composed of an Interpreter, Graph Optimizer, and Backend, with pre‑processing capabilities such as model quantization and sparsity. Recent iterations added ONNX support (TensorFlow, Paddle), reduced CPU usage, and introduced int8 and sparsity optimizations.

Pre‑Inference Optimizations : • Model Quantization : INT8 TensorCore offers double the performance of FP16; InferX provides a quantization SDK, with PTQ already deployed in OCR and copyright detection, achieving near‑lossless accuracy and ~2× speedup. • Structured Sparsity : Leveraging 2:4 sparsity on Nvidia Ampere TensorCores yields up to 2× acceleration for supported conv/linear ops, though overall model speedup is lower due to kernel coverage limitations.

OCR Use‑Case Optimizations : • Model adaptation via ONNX parsing and custom CUDA deformable‑convolution kernels (NHWC layout, memory alignment) improves cache friendliness and reduces im2col overhead. • JPEG decoding moved from CPU‑bound libjpeg to a CUDA‑accelerated nvjpeg‑based library, cutting decode time to ¼ of CPU. • Unified video/live‑stream OCR services share a single model server, enabling priority handling for live streams and boosting GPU utilization to ~80% while cutting overall resource usage by 63%.

Triton Model Service : Chosen for its multi‑framework support, dynamic batching, model orchestration (BLS), and metrics collection. Integration with InferX provides low‑latency, high‑throughput inference pipelines.

Inference Process with Triton + InferX : Incoming HTTP/gRPC requests are dynamically batched, dispatched via BLS scripts to sub‑models accelerated by InferX, supporting both parallel and pipelined multi‑model scenarios, with synchronous result return and unified monitoring.

Performance Gains : Triton deployment yields 3–8× higher per‑instance throughput, 50% fewer GPUs, and >90% GPU utilization under stress; combined with InferX's 4–7× acceleration, the system handles traffic growth without additional hardware.

Conclusion : The self‑developed InferX framework together with Triton model service substantially improves resource efficiency, reduces costs, maintains low response latency, and accelerates AI service development and deployment across diverse business needs.

OCRCUDAAI optimizationInferenceTritonModel QuantizationStructured Sparsity
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.