Tagged articles
10 articles
Page 1 of 1
Lao Guo's Learning Space
Lao Guo's Learning Space
May 12, 2026 · Artificial Intelligence

Which Inference Framework Maximizes Your GPU Performance in 2026?

This article compares six popular LLM inference frameworks—vLLM, TensorRT‑LLM, llama.cpp, ds4.c, Ollama, and Omlx—across performance, ease of use, and hardware compatibility, then provides a practical matrix to help users select the best fit for their GPU.

Apple SiliconGPU performanceLLM inference
0 likes · 10 min read
Which Inference Framework Maximizes Your GPU Performance in 2026?
Tencent Cloud Developer
Tencent Cloud Developer
Sep 26, 2025 · Fundamentals

Why GPUs Really Matter: From Architecture Basics to CUDA Programming

This article explains why GPUs have become the preferred platform for high‑performance computing, covering Dennard scaling, GPU speed advantages, theoretical FLOPS calculations, CUDA programming examples like SAXPY, the SIMT execution model, instruction pipelines, and modern techniques for handling branch divergence and register bank conflicts.

CUDA programmingGPU architectureGPU performance
0 likes · 38 min read
Why GPUs Really Matter: From Architecture Basics to CUDA Programming
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2025 · Artificial Intelligence

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

This article examines the latest Nvidia GPU lineup—including A100, H100, A800, H800, and the upcoming H20—detailing their architectures, performance metrics for AI training and inference, cost considerations, and provides a step‑by‑step guide for building a high‑performance compute center.

AI trainingCompute clusterGPU performance
0 likes · 11 min read
Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 25, 2024 · Artificial Intelligence

How Transformers Work: From Tensor Basics to GPU Performance Analysis

This article provides a comprehensive, engineer‑focused breakdown of transformer architecture—including tensor fundamentals, matrix multiplication, GPU theoretical compute, attention and FFN mechanics, quantitative parameter and FLOP analysis, performance metrics like MFU, parallelism strategies, variant optimizations, and practical exercise questions—offering clear insight into large‑model efficiency and scaling.

FFNGPU performanceTransformer
0 likes · 33 min read
How Transformers Work: From Tensor Basics to GPU Performance Analysis
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 24, 2024 · Artificial Intelligence

Evolution and Challenges of AI Infrastructure: Scaling Large Models on Cloud GPUs

In this talk from the 2024 China Generative AI Conference, Li Peng outlines the escalating computational demands of large‑model training and inference, identifies power, memory and communication walls, and presents Alibaba Cloud’s DeepGPU solutions and best‑practice strategies for scaling AI workloads on cloud GPUs.

DeepGPUGPU performancecloud computing
0 likes · 13 min read
Evolution and Challenges of AI Infrastructure: Scaling Large Models on Cloud GPUs
Baidu Tech Salon
Baidu Tech Salon
Sep 20, 2023 · Artificial Intelligence

Live Session: Introduction to NVIDIA Nsight Systems and Compute for AI Performance Analysis

In a live session, NVIDIA senior deep‑learning solutions architect Zhai Jian demonstrates how to use Nsight Systems and Nsight Compute to analyze a simple neural‑network training workload, accelerate BERT with mixed precision, and examine matrix‑transpose kernels, with registration via QR code and a detailed event schedule.

AI toolsBERTGPU performance
0 likes · 2 min read
Live Session: Introduction to NVIDIA Nsight Systems and Compute for AI Performance Analysis
Baidu Geek Talk
Baidu Geek Talk
Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling
0 likes · 23 min read
Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets
DaTaobao Tech
DaTaobao Tech
Sep 7, 2022 · Artificial Intelligence

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation

The team enhanced real‑time recommendation by redesigning TensorFlow graphs—using constant‑folding, a custom CallGraphOP cache, a simplified dense layer, and CUDA‑Graph compatibility—boosting single‑machine throughput ~40%, raising GPU utilization from 30% to 43%, cutting latency and saving roughly 30% of hardware resources.

CUDA GraphGPU performanceModel Optimization
0 likes · 11 min read
Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation