Artificial Intelligence 10 min read

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

Bilibili Tech

Jun 13, 2023

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

This article introduces InferX, a self‑developed inference framework created to address the growing complexity of AI algorithms and the increasing pressure on online compute resources at Bilibili. The framework consists of an Interpreter, Graph Optimizer, and Backend, and adds pre‑processing optimizations such as model quantization and structured sparsity.

Background and Challenges AI algorithm complexity and traffic growth lead to higher CPU/GPU consumption, longer response times, and difficulty scaling services. Specific challenges include linear resource growth with traffic, deployment of large language models (BERT, GPT, T5‑Large), and massive frame‑level processing for OCR (over 1 billion 720p images per day).

Goals - Increase inference throughput while slowing resource growth. - Reduce response time and improve service quality. - Enable new business scenarios.

InferX Framework Overview The framework separates inference into three core components and supports ONNX conversion, TensorRT lowering, and backend extensions. Recent iterations added ONNX support, runtime resource‑usage improvements, int8 quantization, sparsity handling, and expanded image operator capabilities.

Pre‑Inference Optimizations - Model Quantization : INT8 TensorCore delivers up to 2× performance over FP16; an in‑house quantization SDK enables easy PTQ for OCR and copyright detection with near‑lossless accuracy. - Structured Sparsity : 2:4 sparsity (supported from NVIDIA Ampere) accelerates TensorCore operations; pruning and sparsity‑aware kernels provide ~2× speedup for dense layers, though benefits vary across kernels.

Case Study: OCR Optimization OCR workloads require per‑frame processing and consume large compute resources. InferX introduced an ONNX parser to handle third‑party operators (e.g., deformable convolution) and implemented a CUDA‑based deformable convolution kernel with NHWC layout and memory‑aligned matrix multiplication. Quantization and sparsity further reduced inference time, achieving ~25% speedup. A custom JPEG decoder (nvjpeg‑based) lowered CPU usage and cut decode time to ¼ of the original CPU‑only path. Service‑level changes unified video and live‑stream OCR pipelines, introduced priority handling, and increased GPU utilization to 80% while cutting total GPU count by 63%.

Triton Model Service To improve throughput and parallelism, the team evaluated open‑source serving stacks and selected NVIDIA Triton Inference Server. Triton provides multi‑framework model support, dynamic batching, BLS orchestration, and rich metrics. Integrating InferX into Triton creates a combined solution that delivers low latency and high throughput.

Performance Gains - Triton‑based services achieve 3–8× higher throughput per instance and reduce GPU card count by ~50%. - Adding InferX yields an additional 4–7× inference acceleration, pushing GPU utilization above 90% without additional hardware.

Conclusion The joint deployment of the InferX inference framework and Triton model service dramatically improves resource efficiency, cuts costs, and maintains low response times, enabling rapid AI service rollout across diverse business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization Quantization AI inference GPU Utilization InferX sparsity Triton

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.