Tagged articles
10 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 23, 2026 · Artificial Intelligence

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

CUDA GraphINT8 QuantizationKunlun XPU
0 likes · 13 min read
How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU
Bilibili Tech
Bilibili Tech
Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingHardware OptimizationInference Acceleration
0 likes · 21 min read
Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies
Baidu Tech Salon
Baidu Tech Salon
Aug 20, 2024 · Artificial Intelligence

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

The PaddlePaddle Neural Network Compiler (CINN) combines a PIR‑based frontend and a hardware‑specific backend to apply graph‑level optimizations, operator fusion, schedule transformations and automatic tuning, delivering up to 4× faster kernels and 30‑60% overall speed‑ups for deep‑learning and scientific workloads.

CINNGPU OptimizationOperator fusion
0 likes · 19 min read
PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance
Baidu Geek Talk
Baidu Geek Talk
Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling
0 likes · 23 min read
Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion
0 likes · 23 min read
Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference

This article presents a comprehensive analysis of AI inference bottlenecks, explores industry acceleration techniques such as model simplification, operator fusion, and single‑operator optimization, and details Baidu Cloud's AIAK‑Inference suite with practical demos showing up to 90% latency reduction.

AI inferenceAIAK-InferenceBaidu Cloud
0 likes · 16 min read
How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Deep LearningDynamic BatchingInference Acceleration
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
DataFunTalk
DataFunTalk
Apr 22, 2022 · Artificial Intelligence

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

This article presents a comprehensive overview of inference optimization methods—including model pruning, quantization, knowledge distillation, caching, instruction‑set acceleration, and operator fusion—and details a GPU‑centric parallel acceleration methodology with CUDA basics, performance‑analysis tools, theoretical limits, and practical case studies, all illustrated with real‑world examples from Tencent's intelligent dialogue products.

GPU AccelerationOperator fusioncaching
0 likes · 18 min read
Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 2, 2019 · Artificial Intelligence

How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine

Alibaba’s MNN (Mobile Neural Network) engine, now open‑sourced on GitHub, showcases how a lightweight, end‑side deep‑learning inference framework tackles fragmentation, optimizes model conversion, scheduling, and execution across diverse devices, delivering significant performance gains for mobile and IoT AI applications.

Inference EngineMNNMobile AI
0 likes · 15 min read
How MNN Powers Mobile AI: Inside Alibaba’s Open‑Source Inference Engine