Tagged articles

GPU inference

21 articles · Page 1 of 1

May 5, 2026 · Artificial Intelligence

vLLM 0.20.1 Fixes Instability and Speed Issues for DeepSeek V4

The vLLM 0.20.1 patch, released shortly after 0.20.0, consolidates stability fixes and performance optimizations for DeepSeek V4, adds several bug fixes, updates installation instructions, and provides targeted upgrade recommendations for different user scenarios.

Bug FixDeepSeek-V4GPU inference

0 likes · 9 min read

vLLM 0.20.1 Fixes Instability and Speed Issues for DeepSeek V4

Old Zhang's AI Learning

Mar 12, 2026 · Artificial Intelligence

Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090

The article details how Claude Opus 4.6's chain‑of‑thought data were used to distill the 27‑billion‑parameter Qwen3.5‑27B model with Unsloth and LoRA, achieving full‑context inference on a single RTX 3090/4090, while outlining performance numbers, hyper‑parameter tips, benchmark gains and the trade‑offs of losing multimodal abilities.

Claude Opus 4.6GPU inferenceLoRA

0 likes · 7 min read

Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090

Old Zhang's AI Learning

Jan 30, 2026 · Artificial Intelligence

PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

PaddleOCR‑VL‑1.5, the latest Baidu release, uses only 0.9 B parameters to achieve 94.5% accuracy on OmniDocBench v1.5, surpassing larger open‑source and commercial OCR models, while offering multi‑task, multi‑language support, lightweight deployment, and detailed performance benchmarks.

DeepSeek-OCRGPU inferenceOCR

0 likes · 9 min read

PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

58 Tech

Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersMulti-LoRA

0 likes · 35 min read

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

Past Memory Big Data

Dec 9, 2025 · Artificial Intelligence

A Decade of Evolution: Inside Pinterest’s AI Platform Journey

Over ten years Pinterest transformed a fragmented machine‑learning stack into a unified AI platform, iterating through stages from early ad‑hoc pipelines to scalable GPU‑accelerated services, while learning that timing, organization alignment, and efficiency are crucial for lasting impact.

AI platformGPU inferenceML Ops

0 likes · 25 min read

A Decade of Evolution: Inside Pinterest’s AI Platform Journey

Alibaba Cloud Observability

Oct 20, 2025 · Artificial Intelligence

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

EmbeddingGPU inferencePerformance Optimization

0 likes · 9 min read

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

Instant Consumer Technology Team

Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference

0 likes · 11 min read

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

Network Intelligence Research Center (NIRC)

Jul 2, 2025 · Artificial Intelligence

Optimizing Deep Learning Inference with TensorRT: A Practical Toolchain Walkthrough

This article walks through TensorRT's core optimization features, auxiliary debugging tools, and a step‑by‑step SMPLer‑X case study, showing how graph simplification, mixed‑precision, and engine generation cut inference latency to roughly 22‑29% of the original runtime.

GPU inferenceONNXPolygraphy

0 likes · 6 min read

Optimizing Deep Learning Inference with TensorRT: A Practical Toolchain Walkthrough

Data Thinking Notes

Feb 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.

AI model optimizationDeepSeekDynamic Quantization

0 likes · 14 min read

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

Baobao Algorithm Notes

Jan 14, 2025 · Industry Insights

Why NVLink Supercharges Llama 3 70B Inference: A Deep Performance Breakdown

An in‑depth analysis shows that NVLink 3.0 reduces all‑reduce communication latency for Llama 3 70B inference from over 1.8 seconds to under 100 ms, delivering a dramatic speedup compared with PCIe 4.0 and highlighting the critical role of high‑bandwidth interconnects in large‑model deployments.

All-reduceGPU inferenceLlama 3

0 likes · 5 min read

Why NVLink Supercharges Llama 3 70B Inference: A Deep Performance Breakdown

JD Cloud Developers

Mar 14, 2024 · Artificial Intelligence

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

This article details JD Retail's ad‑tech team's deep‑compute optimizations—including a distributed graph‑based heterogeneous framework, GPU‑focused inference engine enhancements, TensorBatch request aggregation, deep‑learning compiler bucket pre‑compilation, asynchronous compilation, and multi‑stream GPU processing—to overcome high‑concurrency, low‑latency online recommendation challenges.

Deep Learning CompilerDistributed ComputingGPU inference

0 likes · 14 min read

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

Alibaba Cloud Native

Dec 30, 2023 · Artificial Intelligence

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

This guide explains how to set up Alibaba Cloud's ACK environment, install the Cloud Native AI Suite, configure TensorRT, and run Stable Diffusion with dramatically reduced latency and memory usage, including detailed commands, performance metrics, and reproducible code snippets.

AI accelerationGPU inferenceStable Diffusion

0 likes · 7 min read

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

Baidu Geek Talk

Nov 9, 2023 · Artificial Intelligence

Deep Learning Model Architecture Evolution in Baidu Search

The article chronicles Baidu Search’s Model Architecture Group’s evolution of deep‑learning‑driven search, detailing the shift from inverted‑index to semantic vector indexing, the use of transformer‑based models for text and image queries, large‑scale offline/online pipelines, and extensive GPU‑centric optimizations such as pruning, quantization and distillation, all aimed at delivering precise, cost‑effective results to hundreds of millions of users.

ERNIEGPU inferenceModel Optimization

0 likes · 14 min read

Deep Learning Model Architecture Evolution in Baidu Search

Meituan Technology Team

Oct 11, 2023 · Artificial Intelligence

Meituan Vision AI Research Highlights and Open‑Source Releases

This article compiles Meituan's cutting‑edge computer‑vision research and engineering achievements—including CVPR award‑winning segmentation, YOLOv6 releases, GPU inference optimizations, the Food2K dataset, and numerous paper digests—to provide practical insights for visual AI practitioners.

CVPRFood2KGPU inference

0 likes · 11 min read

Meituan Vision AI Research Highlights and Open‑Source Releases

NetEase Media Technology Team

Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceModel OptimizationProfiling

0 likes · 44 min read

GPU Model Inference Optimization Practices in NetEase News Recommendation System

Alibaba Cloud Big Data AI Platform

Apr 11, 2023 · Artificial Intelligence

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.

DeepRecEmbeddingVariableGPU inference

0 likes · 13 min read

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

DeWu Technology

Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference

0 likes · 14 min read

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

Alibaba Cloud Big Data AI Platform

Dec 15, 2022 · Artificial Intelligence

Vivo’s DeepRec: Dynamic Embedding and GPU Tricks that Raised CTR by 1.2%

Vivo’s AI recommendation team leveraged Alibaba’s DeepRec engine—introducing dynamic Embedding Variables, feature admission/elimination, Parquet datasets, and advanced CPU/GPU inference optimizations such as SessionGroup, device placement, multi‑stream and BladeDISC compilation—resulting in notable gains in model accuracy, latency reduction, and resource efficiency.

DeepRecGPU inferenceRecommendation Systems

0 likes · 13 min read

Vivo’s DeepRec: Dynamic Embedding and GPU Tricks that Raised CTR by 1.2%

DataFunSummit

Nov 3, 2022 · Artificial Intelligence

Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

This article explains why traditional CPU inference and naïve GPU usage are inefficient for recommendation workloads, introduces NVIDIA Multi‑Process Service (MPS) technology, describes VIVO's custom Rust‑based inference engine and deployment strategies, and presents performance and cost benefits along with practical deployment considerations.

GPU inferenceMPSRecommendation Systems

0 likes · 13 min read

Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

DataFunTalk

Feb 14, 2021 · Artificial Intelligence

TurboTransformers: An Efficient GPU Serving System for Transformer Models

TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.

BERTDynamic BatchingGPU inference

0 likes · 13 min read

TurboTransformers: An Efficient GPU Serving System for Transformer Models

58 Tech

Nov 6, 2019 · Artificial Intelligence

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)

This article explains how the 58 AI platform leverages NVIDIA TensorRT to accelerate deep‑learning inference on GPUs, describes three integration approaches, details the TF‑TRT implementation and Kubernetes deployment, and presents performance gains for ResNet‑50 and OCR models.

AI platformGPU inferenceKubernetes deployment

0 likes · 7 min read

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)