Tagged articles
10 articles
Page 1 of 1
HyperAI Super Neural
HyperAI Super Neural
Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming
0 likes · 12 min read
Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?
Linux Kernel Journey
Linux Kernel Journey
Oct 24, 2025 · Fundamentals

Mastering CUDA Function Type Annotations: A Complete Guide

This article provides a comprehensive overview of CUDA function type annotations—including __global__, __device__, __host__, combined annotations, and memory‑space qualifiers—explains their purposes, characteristics, and syntax, demonstrates practical examples, offers best‑practice guidelines, highlights common pitfalls, and introduces advanced topics such as dynamic parallelism and cooperative groups.

CUDAGPU programmingdevice functions
0 likes · 14 min read
Mastering CUDA Function Type Annotations: A Complete Guide
Tencent Technical Engineering
Tencent Technical Engineering
Mar 21, 2025 · Fundamentals

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

CUDAGPUGPU programming
0 likes · 42 min read
Fundamentals of GPU Architecture and Programming
Infra Learning Club
Infra Learning Club
Jan 31, 2025 · Fundamentals

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.

CUDACUDA terminologyGPU programming
0 likes · 8 min read
Essential CUDA Learning Guide: Basics, Compilation, and Profiling
DeWu Technology
DeWu Technology
Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDADeep LearningGPU programming
0 likes · 22 min read
Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 11, 2023 · Game Development

FidelityFX Super Resolution 1.0: Technical Analysis and Implementation

The article delivers an in‑depth technical dissection of AMD’s FidelityFX Super Resolution 1.0, detailing the EASU spatial upscaling pipeline—its Lanczos2‑based polynomial fitting, 12‑point sampling, gradient calculations, and edge handling—and the RCAS contrast‑adaptive sharpening stage, while also outlining mobile‑friendly optimizations such as half‑precision arithmetic and reduced texture fetches.

EASUFSR 1.0GPU programming
0 likes · 6 min read
FidelityFX Super Resolution 1.0: Technical Analysis and Implementation
政采云技术
政采云技术
Aug 10, 2021 · Frontend Development

WebGL Concepts and Fundamentals

This article introduces WebGL, covering its definition, history, basic concepts, working principles, and practical examples of drawing shapes using both native WebGL API and the Three.js framework.

3D graphics3D web developmentBrowser graphics
0 likes · 17 min read
WebGL Concepts and Fundamentals
Tencent Music Tech Team
Tencent Music Tech Team
Apr 30, 2020 · Mobile Development

Edge Deep Learning Inference on Mobile Devices: Challenges, Hardware Diversity, and Optimization Strategies

Edge deep learning inference on mobile devices faces hardware and software fragmentation, diverse CPUs, GPUs, DSPs, and NPUs, and limited programmability; optimization techniques such as model selection, quantization, and architecture‑specific tuning enable real‑time performance, with most inference on CPUs, GPUs offering 5–10× speedups, and co‑processor support varying across Android and iOS.

DSPGPU programmingNPU
0 likes · 17 min read
Edge Deep Learning Inference on Mobile Devices: Challenges, Hardware Diversity, and Optimization Strategies