Tagged articles

GPU programming

10 articles · Page 1 of 1

Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming

0 likes · 12 min read

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

Java Tech Enthusiast

Dec 8, 2025 · Artificial Intelligence

Explore CUDA Toolkit 13.1: CUDA Tile, Green Contexts, and Performance Boosts

NVIDIA's CUDA Toolkit 13.1 introduces the groundbreaking CUDA Tile programming model, green context support, enhanced math libraries, and numerous performance improvements for AI and GPU workloads, while also adding new developer tools, MPS features, and deterministic options for CUB.

CUDACUDA TileGPU programming

0 likes · 16 min read

Explore CUDA Toolkit 13.1: CUDA Tile, Green Contexts, and Performance Boosts

Linux Kernel Journey

Oct 24, 2025 · Fundamentals

Mastering CUDA Function Type Annotations: A Complete Guide

This article provides a comprehensive overview of CUDA function type annotations—including __global__, __device__, __host__, combined annotations, and memory‑space qualifiers—explains their purposes, characteristics, and syntax, demonstrates practical examples, offers best‑practice guidelines, highlights common pitfalls, and introduces advanced topics such as dynamic parallelism and cooperative groups.

CUDAGPU programmingdevice functions

0 likes · 14 min read

Mastering CUDA Function Type Annotations: A Complete Guide

Network Intelligence Research Center (NIRC)

Jul 15, 2025 · Fundamentals

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

CUDAGPU programmingPyTorch

0 likes · 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

Tencent Technical Engineering

Mar 21, 2025 · Fundamentals

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

CUDAGPUGPU programming

0 likes · 42 min read

Fundamentals of GPU Architecture and Programming

Infra Learning Club

Jan 31, 2025 · Fundamentals

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.

CUDACUDA terminologyGPU programming

0 likes · 8 min read

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

DeWu Technology

Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDAGPU programmingPerformance Optimization

0 likes · 22 min read

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

OPPO Kernel Craftsman

Aug 11, 2023 · Game Development

FidelityFX Super Resolution 1.0: Technical Analysis and Implementation

The article delivers an in‑depth technical dissection of AMD’s FidelityFX Super Resolution 1.0, detailing the EASU spatial upscaling pipeline—its Lanczos2‑based polynomial fitting, 12‑point sampling, gradient calculations, and edge handling—and the RCAS contrast‑adaptive sharpening stage, while also outlining mobile‑friendly optimizations such as half‑precision arithmetic and reduced texture fetches.

EASUFSR 1.0GPU programming

0 likes · 6 min read

FidelityFX Super Resolution 1.0: Technical Analysis and Implementation

政采云技术

Aug 10, 2021 · Frontend Development

WebGL Concepts and Fundamentals

This article introduces WebGL, covering its definition, history, basic concepts, working principles, and practical examples of drawing shapes using both native WebGL API and the Three.js framework.

3D graphics3D web developmentBrowser graphics

0 likes · 17 min read

Tencent Music Tech Team

Apr 30, 2020 · Mobile Development

Edge Deep Learning Inference on Mobile Devices: Challenges, Hardware Diversity, and Optimization Strategies

Edge deep learning inference on mobile devices faces hardware and software fragmentation, diverse CPUs, GPUs, DSPs, and NPUs, and limited programmability; optimization techniques such as model selection, quantization, and architecture‑specific tuning enable real‑time performance, with most inference on CPUs, GPUs offering 5–10× speedups, and co‑processor support varying across Android and iOS.

DSPGPU programmingNPU

0 likes · 17 min read

Edge Deep Learning Inference on Mobile Devices: Challenges, Hardware Diversity, and Optimization Strategies