Tagged articles

GPU performance

11 articles · Page 1 of 1

Jun 22, 2026 · Artificial Intelligence

How DiffusionGemma Shifts LLM Inference Bottleneck from Memory Bandwidth to Compute

DiffusionGemma, an experimental discrete text diffusion model built on the 26B MoE Gemma‑4 architecture, generates whole 256‑token blocks with bidirectional attention, moving the inference bottleneck from memory bandwidth to GPU compute, achieving up to four‑fold speed gains on H100 and RTX 5090 GPUs, though with lower output quality than standard autoregressive models.

DiffusionGemmaGPU performanceLLM Inference

0 likes · 7 min read

How DiffusionGemma Shifts LLM Inference Bottleneck from Memory Bandwidth to Compute

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Which Inference Framework Maximizes Your GPU Performance in 2026?

This article compares six popular LLM inference frameworks—vLLM, TensorRT‑LLM, llama.cpp, ds4.c, Ollama, and Omlx—across performance, ease of use, and hardware compatibility, then provides a practical matrix to help users select the best fit for their GPU.

Apple SiliconGPU performanceLLM Inference

0 likes · 10 min read

Which Inference Framework Maximizes Your GPU Performance in 2026?

HyperAI Super Neural

Feb 4, 2026 · Artificial Intelligence

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

The article walks through a step‑by‑step optimization of a simple elementwise addition kernel (C = A + B) on HyperAI's RTX 5090 cloud instance, covering FP32 baseline, vectorized FP32, several FP16 variants, benchmark methodology, performance results, and the reasoning behind thread‑block sizing.

CUDAElementwiseFP16

0 likes · 30 min read

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

Tencent Cloud Developer

Sep 26, 2025 · Fundamentals

Why GPUs Really Matter: From Architecture Basics to CUDA Programming

This article explains why GPUs have become the preferred platform for high‑performance computing, covering Dennard scaling, GPU speed advantages, theoretical FLOPS calculations, CUDA programming examples like SAXPY, the SIMT execution model, instruction pipelines, and modern techniques for handling branch divergence and register bank conflicts.

CUDA programmingGPU architectureGPU performance

0 likes · 38 min read

Why GPUs Really Matter: From Architecture Basics to CUDA Programming

IT Services Circle

Sep 11, 2025 · Mobile Development

iPhone 17 Pro Benchmarks Reveal 15% CPU and 41% GPU Gains Over iPhone 16 Pro

Geekbench scores show the iPhone 17 Pro and Pro Max delivering a 15% single‑core and 22% multi‑core CPU boost plus a 41% GPU performance jump compared with the iPhone 16 Pro, while the new models also feature up to 12 GB of RAM and improved thermal design.

CPU performanceGPU performanceRAM

0 likes · 4 min read

iPhone 17 Pro Benchmarks Reveal 15% CPU and 41% GPU Gains Over iPhone 16 Pro

Architects' Tech Alliance

Apr 3, 2025 · Artificial Intelligence

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

This article examines the latest Nvidia GPU lineup—including A100, H100, A800, H800, and the upcoming H20—detailing their architectures, performance metrics for AI training and inference, cost considerations, and provides a step‑by‑step guide for building a high‑performance compute center.

AI trainingCompute clusterGPU performance

0 likes · 11 min read

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

Baidu Intelligent Cloud Tech Hub

Jul 25, 2024 · Artificial Intelligence

How Transformers Work: From Tensor Basics to GPU Performance Analysis

This article provides a comprehensive, engineer‑focused breakdown of transformer architecture—including tensor fundamentals, matrix multiplication, GPU theoretical compute, attention and FFN mechanics, quantitative parameter and FLOP analysis, performance metrics like MFU, parallelism strategies, variant optimizations, and practical exercise questions—offering clear insight into large‑model efficiency and scaling.

FFNGPU performanceTransformer

0 likes · 33 min read

How Transformers Work: From Tensor Basics to GPU Performance Analysis

Alibaba Cloud Infrastructure

Apr 24, 2024 · Artificial Intelligence

Evolution and Challenges of AI Infrastructure: Scaling Large Models on Cloud GPUs

In this talk from the 2024 China Generative AI Conference, Li Peng outlines the escalating computational demands of large‑model training and inference, identifies power, memory and communication walls, and presents Alibaba Cloud’s DeepGPU solutions and best‑practice strategies for scaling AI workloads on cloud GPUs.

Cloud ComputingDeepGPUGPU performance

0 likes · 13 min read

Evolution and Challenges of AI Infrastructure: Scaling Large Models on Cloud GPUs

Baidu Tech Salon

Sep 20, 2023 · Artificial Intelligence

Live Session: Introduction to NVIDIA Nsight Systems and Compute for AI Performance Analysis

In a live session, NVIDIA senior deep‑learning solutions architect Zhai Jian demonstrates how to use Nsight Systems and Nsight Compute to analyze a simple neural‑network training workload, accelerate BERT with mixed precision, and examine matrix‑transpose kernels, with registration via QR code and a detailed event schedule.

AI toolsBERTGPU performance

0 likes · 2 min read

Live Session: Introduction to NVIDIA Nsight Systems and Compute for AI Performance Analysis

Baidu Geek Talk

Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling

0 likes · 23 min read

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

DaTaobao Tech

Sep 7, 2022 · Artificial Intelligence

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation

The team enhanced real‑time recommendation by redesigning TensorFlow graphs—using constant‑folding, a custom CallGraphOP cache, a simplified dense layer, and CUDA‑Graph compatibility—boosting single‑machine throughput ~40%, raising GPU utilization from 30% to 43%, cutting latency and saving roughly 30% of hardware resources.

CUDA GraphGPU performanceModel Optimization

0 likes · 11 min read

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation