Tagged articles

TensorRT

34 articles · Page 1 of 1

Mar 1, 2026 · Artificial Intelligence

Automating Regression Tests for TensorRT Inference Services

The article outlines a comprehensive, repeatable regression testing framework for TensorRT inference pipelines, covering engine build validation, functional correctness against golden outputs, performance monitoring, common pitfalls, and CI/CD integration to ensure model updates remain both fast and reliable.

INT8 QuantizationMLOpsPerformance Regression

0 likes · 12 min read

Automating Regression Tests for TensorRT Inference Services

Sohu Tech Products

Dec 17, 2025 · Artificial Intelligence

How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms

Facing 53.64 ms per‑image latency in a Flask‑served Vision Transformer classifier, we iteratively optimized the pipeline—switching to ONNX Runtime, leveraging TensorRT, replacing Pillow with OpenCV, eliminating URL downloads, and finally batching requests—reducing average server‑side processing to 8.34 ms, a 6.4× speedup.

BatchingFlaskONNX

0 likes · 28 min read

How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms

Tencent Advertising Technology

Jul 17, 2025 · Artificial Intelligence

LEADRE: Knowledge‑Enhanced LLMs Supercharge Display Ad Recommendations

The paper introduces LEADRE, a multi‑faceted knowledge‑enhanced large language model‑driven display advertisement recommender that tackles user interest modeling, knowledge alignment, and low‑latency deployment, achieving significant GMV gains in Tencent’s ad platforms through innovative prompt engineering, semantic alignment, and TensorRT‑accelerated inference.

Knowledge AlignmentLLMPrompt engineering

0 likes · 16 min read

LEADRE: Knowledge‑Enhanced LLMs Supercharge Display Ad Recommendations

Network Intelligence Research Center (NIRC)

Jul 2, 2025 · Artificial Intelligence

Optimizing Deep Learning Inference with TensorRT: A Practical Toolchain Walkthrough

This article walks through TensorRT's core optimization features, auxiliary debugging tools, and a step‑by‑step SMPLer‑X case study, showing how graph simplification, mixed‑precision, and engine generation cut inference latency to roughly 22‑29% of the original runtime.

GPU inferenceONNXPolygraphy

0 likes · 6 min read

Optimizing Deep Learning Inference with TensorRT: A Practical Toolchain Walkthrough

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalQuantization

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

Alibaba Cloud Developer

Nov 22, 2024 · Artificial Intelligence

Master YOLOv8: End-to-End Guide to Object Detection, Training, and Deployment

This comprehensive tutorial walks you through YOLOv8 object detection—from environment setup and dataset preparation to model training, validation, testing, and conversion to ONNX and TensorRT—providing clear commands, code snippets, and visual results for each step.

Model TrainingONNXTensorRT

0 likes · 8 min read

Master YOLOv8: End-to-End Guide to Object Detection, Training, and Deployment

Zhuanzhuan Tech

Oct 16, 2024 · Artificial Intelligence

Optimizing TorchServe Inference Service Architecture for High‑Performance AI Deployment

This article details the engineering practice of optimizing TorchServe‑based AI inference services, covering background challenges, framework selection, GPU‑accelerated Torch‑TRT integration, CPU‑side preprocessing improvements, and deployment on Kubernetes to achieve higher throughput and lower resource consumption.

GPUOptimizationModelServingPyTorch

0 likes · 17 min read

Optimizing TorchServe Inference Service Architecture for High‑Performance AI Deployment

DataFunTalk

Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUTTS

0 likes · 20 min read

Efficient Deployment of Speech AI Models on GPUs

Rare Earth Juejin Tech Community

Jan 17, 2024 · Artificial Intelligence

Building a License Plate Recognition Service with C++, TensorRT, and Go

This article details how to train a YOLOv8‑pose model for license‑plate detection, convert it to TensorRT engine, implement C++ inference and preprocessing, expose the functionality via CGO to Go, and assemble a lightweight web service for real‑time plate recognition.

C#CGOGo

0 likes · 12 min read

Building a License Plate Recognition Service with C++, TensorRT, and Go

Alibaba Cloud Native

Dec 30, 2023 · Artificial Intelligence

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

This guide explains how to set up Alibaba Cloud's ACK environment, install the Cloud Native AI Suite, configure TensorRT, and run Stable Diffusion with dramatically reduced latency and memory usage, including detailed commands, performance metrics, and reproducible code snippets.

AI accelerationGPU inferenceStable Diffusion

0 likes · 7 min read

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

NetEase Media Technology Team

Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceModel OptimizationProfiling

0 likes · 44 min read

GPU Model Inference Optimization Practices in NetEase News Recommendation System

DataFunSummit

Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRConversational AIGPU deployment

0 likes · 14 min read

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

Kuaishou Large Model

Mar 31, 2023 · Artificial Intelligence

How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023

At NVIDIA GTC 2023, Kuaishou engineers unveiled cutting‑edge solutions ranging from video quality assessment and enhancement, 3D digital‑human live streaming, a custom TensorRT‑based performance framework, large‑scale recommendation model acceleration, to multimodal massive‑model deployment for short‑video scenarios.

Recommendation SystemsTensorRTai-optimization

0 likes · 9 min read

How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023

DeWu Technology

Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference

0 likes · 14 min read

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

Meituan Technology Team

Feb 9, 2023 · Backend Development

Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Meituan Visual's engineering team tackled the common low‑GPU‑utilization bottleneck in online inference services by splitting model structures and adopting micro‑service deployment, raising GPU usage from 40% to 100% and more than tripling QPS, and then generalized the approach for other GPU‑based services.

GPUMicroservicesPerformance Optimization

0 likes · 21 min read

Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Bilibili Tech

Nov 8, 2022 · Industry Insights

BANG Engine: Multi‑Level Pipelines & GPU Acceleration for Faster Video Transcoding

To meet Bilibili’s demanding live and on‑demand video transcoding needs, the BANG engine combines a multi‑stage pipeline architecture, frame‑block and multi‑frame parallelism, SIMD‑based CPU acceleration, and TensorRT/TensorFlow GPU inference, offering configurable string‑based pipelines that dramatically increase throughput while simplifying integration.

BilibiliGPU AccelerationTensorRT

0 likes · 18 min read

BANG Engine: Multi‑Level Pipelines & GPU Acceleration for Faster Video Transcoding

Alimama Tech

Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Auto ScalingGPU UtilizationHigh-performance computing

0 likes · 19 min read

Optimizing GPU Utilization for Multimedia AI Services with high_service

Alimama Tech

Oct 26, 2022 · Artificial Intelligence

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

The paper analyzes why Alibaba Mama’s intelligent creative video service suffers low GPU utilization—due to Python GIL blocking, lack of kernel fusion, and serialized CUDA streams—and details service‑level changes (separate CPU/GPU processes, shared‑memory queues, priority scheduling) and operator‑level kernel‑fusion techniques (channels‑last layouts, custom pooling, TensorRT conversion) that raise utilization from ~30 % to near 100 % and boost throughput by 75 %.

GPU OptimizationPythonTensorRT

0 likes · 20 min read

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

Meituan Technology Team

Sep 15, 2022 · Artificial Intelligence

YOLOv6 2.0: Enhanced Object Detection Models and Quantization Solutions

The new YOLOv6 2.0 release upgrades lightweight and medium‑large models with a CSPStackRep backbone, self‑distillation, and a custom quantization pipeline, delivering up to 869 FPS for the quantized YOLOv6‑S and achieving 49.5%/52.5% AP on COCO while halving training time.

COCO benchmarkCSPStackRepQuantization

0 likes · 6 min read

YOLOv6 2.0: Enhanced Object Detection Models and Quantization Solutions

Meituan Technology Team

Jul 6, 2022 · Artificial Intelligence

Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

The article details Meituan's engineering journey from small DNNs to hundred‑gigabyte deep learning models for food‑delivery ads, analyzing online latency and offline efficiency challenges and presenting distributed storage, CPU/GPU acceleration, OpenVINO, TensorRT, CodeGen, and data‑pipeline optimizations that dramatically improve throughput, memory usage, and sample‑building speed.

CPU accelerationDistributed storageGPU Acceleration

0 likes · 45 min read

Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

Yiche Technology

Jan 27, 2022 · Backend Development

C++ Multithreaded Service Architecture for High‑Throughput AI Inference

The article explains how to design a C++‑based multithreaded service that uses Pthreads, channels, and TensorRT to parallelize deep‑learning inference tasks, thereby reducing latency and dramatically increasing throughput for AI applications such as facial‑recognition access control systems.

AI inferenceC#TensorRT

0 likes · 11 min read

C++ Multithreaded Service Architecture for High‑Throughput AI Inference

58 Tech

Dec 21, 2021 · Artificial Intelligence

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

dl_inference is an open‑source, production‑grade deep learning inference platform that supports TensorFlow, PyTorch and Caffe models, offering GPU and CPU deployment, TensorRT and MKL acceleration, multi‑node load balancing, and extensive Q&A on model conversion, hardware requirements, INT8 quantization, and performance gains.

CPUGPUMKL

0 likes · 8 min read

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

58 Tech

Dec 8, 2021 · Artificial Intelligence

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.

Intel MKLTensorRTinference

0 likes · 12 min read

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

iQIYI Technical Product Team

Nov 5, 2021 · Artificial Intelligence

Accelerating 4K Video Super‑Resolution with TensorRT: iQIYI’s Optimization and Production Practices

iQIYI optimized a 4K video super-resolution model using TensorRT, employing split of graph, operator fusion, custom CUDA kernels, and int8 quantization, achieving tenfold speedup (≈180 ms per 1080p frame) and demonstrating deep customization potential for large‑scale production.

INT8 QuantizationModel OptimizationTensorRT

0 likes · 17 min read

Accelerating 4K Video Super‑Resolution with TensorRT: iQIYI’s Optimization and Production Practices

TiPaiPai Technical Team

Jun 25, 2021 · Artificial Intelligence

Mastering TensorRT: Deploy Deep Learning Models Efficiently

This article introduces TensorRT, explains its deployment workflow from model training to engine generation, shows how to register custom operators for ONNX and create TensorRT plugins, and explores deformable convolution (DCN) implementation strategies for high‑performance AI inference.

AI inferenceCUDACustom Operators

0 likes · 8 min read

Mastering TensorRT: Deploy Deep Learning Models Efficiently

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

360 Tech Engineering

Mar 1, 2021 · Artificial Intelligence

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

BERTGPUModel Optimization

0 likes · 10 min read

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

DataFunTalk

Jan 10, 2021 · Artificial Intelligence

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

This article presents a comprehensive overview of Didi's machine translation platform, covering its evolution from statistical to neural models, the Transformer architecture with relative position and larger FFN, data preparation, training strategies such as back‑translation and knowledge distillation, deployment optimizations with TensorRT, and the team's successful participation in the WMT2020 news translation task.

BLEUMachine TranslationTensorRT

0 likes · 14 min read

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

58 Tech

Nov 20, 2020 · Artificial Intelligence

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

The article details the development, architecture, and optimization of 58.com’s AI algorithm platform (WPAI), covering its background, overall design, large‑scale distributed machine learning, deep‑learning platform features, inference performance enhancements, GPU resource scheduling improvements, and future directions.

AI platformGPU schedulingInference Optimization

0 likes · 15 min read

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

Didi Tech

Oct 27, 2020 · Artificial Intelligence

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

Didi's machine translation system combines a Transformer‑big architecture with relative position representations, enlarged feed‑forward networks, iterative back‑translation, knowledge‑distillation and domain fine‑tuning, optimized via TensorRT for speed, achieving a BLEU 36.6 and third place in the WMT2020 Chinese‑to‑English news task.

BLEUMachine TranslationTensorRT

0 likes · 15 min read

Zhengtong Technical Team

Aug 14, 2020 · Artificial Intelligence

ZTFace: A High‑Precision, Fast Face Recognition Algorithm

This article presents ZTFace, an end‑to‑end face recognition solution that integrates face detection, alignment, feature embedding, verification, anti‑spoofing and attribute recognition using deep learning, details its backbone networks, loss functions, training datasets, experimental results on WIDER FACE and LFW, and demonstrates acceleration with TensorRT.

TensorRTZTFacecomputer vision

0 likes · 17 min read

ZTFace: A High‑Precision, Fast Face Recognition Algorithm

iQIYI Technical Product Team

Jul 3, 2020 · Artificial Intelligence

Optimizing Video Inference Services for High GPU Utilization in AI Applications

By moving decoding, color conversion, preprocessing, inference, and re‑encoding entirely onto the GPU and enabling batch processing with flexible Python scripts, iQIYI’s video‑image enhancement service achieved ten‑fold throughput, over 90 % GPU utilization, and dramatically lower resource use, accelerating AI video inference deployment.

AI DeploymentDeepStreamGPU Optimization

0 likes · 14 min read

Optimizing Video Inference Services for High GPU Utilization in AI Applications

58 Tech

Nov 6, 2019 · Artificial Intelligence

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)

This article explains how the 58 AI platform leverages NVIDIA TensorRT to accelerate deep‑learning inference on GPUs, describes three integration approaches, details the TF‑TRT implementation and Kubernetes deployment, and presents performance gains for ResNet‑50 and OCR models.

AI platformGPU inferenceKubernetes deployment

0 likes · 7 min read

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)

HomeTech

Sep 4, 2019 · Artificial Intelligence

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.

Model OptimizationTensorFlowTensorRT

0 likes · 13 min read

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results