Tagged articles
20 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
May 5, 2026 · Artificial Intelligence

vLLM 0.20.1 Fixes Instability and Speed Issues for DeepSeek V4

The vLLM 0.20.1 patch, released shortly after 0.20.0, consolidates stability fixes and performance optimizations for DeepSeek V4, adds several bug fixes, updates installation instructions, and provides targeted upgrade recommendations for different user scenarios.

DeepSeek-V4GPU inferenceModel Deployment
0 likes · 9 min read
vLLM 0.20.1 Fixes Instability and Speed Issues for DeepSeek V4
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 12, 2026 · Artificial Intelligence

Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090

The article details how Claude Opus 4.6's chain‑of‑thought data were used to distill the 27‑billion‑parameter Qwen3.5‑27B model with Unsloth and LoRA, achieving full‑context inference on a single RTX 3090/4090, while outlining performance numbers, hyper‑parameter tips, benchmark gains and the trade‑offs of losing multimodal abilities.

Claude Opus 4.6GPU inferenceLoRA
0 likes · 7 min read
Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090
58 Tech
58 Tech
Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersModel Serving
0 likes · 35 min read
How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 20, 2025 · Artificial Intelligence

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

EmbeddingGPU inferencePerformance Optimization
0 likes · 9 min read
How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI PerformanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Data Thinking Notes
Data Thinking Notes
Feb 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.

AI model optimizationDeepSeekDynamic Quantization
0 likes · 14 min read
How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide
JD Cloud Developers
JD Cloud Developers
Mar 14, 2024 · Artificial Intelligence

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

This article details JD Retail's ad‑tech team's deep‑compute optimizations—including a distributed graph‑based heterogeneous framework, GPU‑focused inference engine enhancements, TensorBatch request aggregation, deep‑learning compiler bucket pre‑compilation, asynchronous compilation, and multi‑stream GPU processing—to overcome high‑concurrency, low‑latency online recommendation challenges.

Deep Learning CompilerGPU inferencedistributed computing
0 likes · 14 min read
How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing
Alibaba Cloud Native
Alibaba Cloud Native
Dec 30, 2023 · Artificial Intelligence

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

This guide explains how to set up Alibaba Cloud's ACK environment, install the Cloud Native AI Suite, configure TensorRT, and run Stable Diffusion with dramatically reduced latency and memory usage, including detailed commands, performance metrics, and reproducible code snippets.

AI accelerationGPU inferenceStable Diffusion
0 likes · 7 min read
How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK
Baidu Geek Talk
Baidu Geek Talk
Nov 9, 2023 · Artificial Intelligence

Deep Learning Model Architecture Evolution in Baidu Search

The article chronicles Baidu Search’s Model Architecture Group’s evolution of deep‑learning‑driven search, detailing the shift from inverted‑index to semantic vector indexing, the use of transformer‑based models for text and image queries, large‑scale offline/online pipelines, and extensive GPU‑centric optimizations such as pruning, quantization and distillation, all aimed at delivering precise, cost‑effective results to hundreds of millions of users.

ErnieGPU inferenceModel Optimization
0 likes · 14 min read
Deep Learning Model Architecture Evolution in Baidu Search
Meituan Technology Team
Meituan Technology Team
Oct 11, 2023 · Artificial Intelligence

Meituan Vision AI Research Highlights and Open‑Source Releases

This article compiles Meituan's cutting‑edge computer‑vision research and engineering achievements—including CVPR award‑winning segmentation, YOLOv6 releases, GPU inference optimizations, the Food2K dataset, and numerous paper digests—to provide practical insights for visual AI practitioners.

CVPRComputer VisionDeep Learning
0 likes · 11 min read
Meituan Vision AI Research Highlights and Open‑Source Releases
NetEase Media Technology Team
NetEase Media Technology Team
Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceModel OptimizationProfiling
0 likes · 44 min read
GPU Model Inference Optimization Practices in NetEase News Recommendation System
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 11, 2023 · Artificial Intelligence

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.

DeepRecDistributed TrainingEmbeddingVariable
0 likes · 13 min read
How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations
DeWu Technology
DeWu Technology
Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference
0 likes · 14 min read
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 15, 2022 · Artificial Intelligence

Vivo’s DeepRec: Dynamic Embedding and GPU Tricks that Raised CTR by 1.2%

Vivo’s AI recommendation team leveraged Alibaba’s DeepRec engine—introducing dynamic Embedding Variables, feature admission/elimination, Parquet datasets, and advanced CPU/GPU inference optimizations such as SessionGroup, device placement, multi‑stream and BladeDISC compilation—resulting in notable gains in model accuracy, latency reduction, and resource efficiency.

DeepRecGPU inferenceRecommendation Systems
0 likes · 13 min read
Vivo’s DeepRec: Dynamic Embedding and GPU Tricks that Raised CTR by 1.2%
DataFunSummit
DataFunSummit
Nov 3, 2022 · Artificial Intelligence

Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

This article explains why traditional CPU inference and naïve GPU usage are inefficient for recommendation workloads, introduces NVIDIA Multi‑Process Service (MPS) technology, describes VIVO's custom Rust‑based inference engine and deployment strategies, and presents performance and cost benefits along with practical deployment considerations.

GPU inferenceKubernetesMPS
0 likes · 13 min read
Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference
DataFunTalk
DataFunTalk
Feb 14, 2021 · Artificial Intelligence

TurboTransformers: An Efficient GPU Serving System for Transformer Models

TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.

BERTDynamic BatchingGPU inference
0 likes · 13 min read
TurboTransformers: An Efficient GPU Serving System for Transformer Models
58 Tech
58 Tech
Nov 6, 2019 · Artificial Intelligence

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)

This article explains how the 58 AI platform leverages NVIDIA TensorRT to accelerate deep‑learning inference on GPUs, describes three integration approaches, details the TF‑TRT implementation and Kubernetes deployment, and presents performance gains for ResNet‑50 and OCR models.

AI PlatformGPU inferenceKubernetes deployment
0 likes · 7 min read
TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)