Tagged articles
105 articles
Page 1 of 2
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 8 min read
Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 9 min read
Can 99% Sparse Transformers Run Faster? Insights from the Original Authors
Machine Heart
Machine Heart
May 2, 2026 · Industry Insights

Beyond CUDA: Nvidia’s Token Factory and Supply Chain Guard Its Moat from TPU

The article examines Nvidia’s competitive moat beyond CUDA, detailing how its token‑factory model, extensive supply‑chain commitments, and a flexible accelerator ecosystem contrast with Google’s TPU ASIC approach, while also exploring the impact of AI agents on future compute demand.

AI hardwareCUDANvidia
0 likes · 7 min read
Beyond CUDA: Nvidia’s Token Factory and Supply Chain Guard Its Moat from TPU
CodeTrend
CodeTrend
Apr 19, 2026 · Artificial Intelligence

Understanding NVIDIA Jetpack: Design Framework, Architecture, and Flashing Process

This article explains NVIDIA Jetpack’s three‑layer architecture, its relationship with the SDK Manager installer, step‑by‑step flashing procedures for Jetson devices, common failure points such as the 35.29% stall, and practical troubleshooting and hybrid manual‑automatic solutions.

CUDAEmbedded AIFlashing
0 likes · 11 min read
Understanding NVIDIA Jetpack: Design Framework, Architecture, and Flashing Process
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 15, 2026 · Artificial Intelligence

Industrial Code LLM Learns to Think Before Writing – InCoder-32B Thinking Tackles Verilog and CUDA Pitfalls

The article analyzes InCoder-32B Thinking, an industrial‑code large language model that incorporates error‑driven chain‑of‑thought and an Industrial Code World Model to predict execution outcomes, adapt reasoning depth, and achieve high accuracy across diverse hardware‑centric benchmarks.

CUDAVerilogerror-driven chain of thought
0 likes · 7 min read
Industrial Code LLM Learns to Think Before Writing – InCoder-32B Thinking Tackles Verilog and CUDA Pitfalls
Machine Heart
Machine Heart
Apr 14, 2026 · Artificial Intelligence

When Verilog and CUDA Fail: How Industrial Code Models Are Learning to Think Before They Write

The article analyzes InCoder-32B Thinking, an industrial code large model that integrates error‑driven chain‑of‑thought and a world‑model to predict real‑system outcomes, showing high accuracy on diverse benchmarks and demonstrating adaptive reasoning depth for tasks ranging from Verilog synthesis to CUDA kernel optimization.

AICUDAVerilog
0 likes · 8 min read
When Verilog and CUDA Fail: How Industrial Code Models Are Learning to Think Before They Write
SuanNi
SuanNi
Mar 29, 2026 · Artificial Intelligence

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.

AICUDAGPU Optimization
0 likes · 13 min read
How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization
AI Info Trend
AI Info Trend
Mar 24, 2026 · Industry Insights

NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise

The GTC 2026 keynote revealed NVIDIA’s latest DLSS 5 technology using 3‑D guided neural rendering to deliver cinematic‑quality graphics in real time, outlined a 20‑year CUDA ecosystem flywheel that fuels AI acceleration across structured and unstructured data, showcased enterprise case studies like Nestlé’s data‑refresh breakthrough, and highlighted a vast partner network, illustrating how AI is moving from experimental labs to everyday production.

AICUDADLSS
0 likes · 5 min read
NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

CUDA Agent, a large‑scale reinforcement‑learning system from ByteDance and Tsinghua, can automatically generate and optimize CUDA kernels that outperform torch.compile by up to 2× on simple kernels and achieve around 40% higher speed than proprietary models on the hardest benchmarks, while detailing its data‑synthesis pipeline, training workflow, and current limitations.

CUDAGPU OptimizationKernelBench
0 likes · 10 min read
How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat
AI Explorer
AI Explorer
Mar 3, 2026 · Artificial Intelligence

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

ByteDance and Tsinghua University have created the CUDA Agent, an AI compiler that automatically writes and optimizes GPU kernels, delivering up to double the performance, and heralding a shift where AI‑generated low‑level code could reshape the hardware‑software competition landscape.

AI compilerByteDanceCUDA
0 likes · 6 min read
ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels
21CTO
21CTO
Jan 26, 2026 · Artificial Intelligence

What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements

PyTorch 2.10 introduces extensive upgrades for AMD ROCm, Intel XPU, and NVIDIA CUDA, adds new Torch XPU APIs, expands Python 3.14 support, and brings performance‑focused improvements such as fused kernels and enhanced quantization, all available via the official GitHub release.

CUDADeep LearningGPU
0 likes · 4 min read
What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements
HyperAI Super Neural
HyperAI Super Neural
Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming
0 likes · 12 min read
Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 16, 2025 · Industry Insights

Why Computer Science Majors Must Embrace a Massive Paradigm Shift

The article argues that traditional storage‑centric computer science curricula are becoming obsolete as AI‑driven, compute‑centric paradigms dominate hardware, data‑center operations, and software ecosystems, urging universities and students to rapidly adopt new teaching focus and skills.

AI hardwareCUDAassociative memory
0 likes · 10 min read
Why Computer Science Majors Must Embrace a Massive Paradigm Shift
Raymond Ops
Raymond Ops
Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU
0 likes · 15 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
Linux Kernel Journey
Linux Kernel Journey
Dec 7, 2025 · Fundamentals

CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling

This article explains the fundamentals of CUDA performance tuning, covering GPU architectures from Kepler to Volta, the role of SMX, warp schedulers, registers and memory hierarchies, and provides practical guidance on launch configuration, latency hiding, and thread‑block sizing to maximize throughput.

CUDAGPU architecturePerformance Optimization
0 likes · 21 min read
CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling
Linux Kernel Journey
Linux Kernel Journey
Oct 24, 2025 · Fundamentals

Mastering CUDA Function Type Annotations: A Complete Guide

This article provides a comprehensive overview of CUDA function type annotations—including __global__, __device__, __host__, combined annotations, and memory‑space qualifiers—explains their purposes, characteristics, and syntax, demonstrates practical examples, offers best‑practice guidelines, highlights common pitfalls, and introduces advanced topics such as dynamic parallelism and cooperative groups.

CUDAGPU programmingdevice functions
0 likes · 14 min read
Mastering CUDA Function Type Annotations: A Complete Guide
Linux Kernel Journey
Linux Kernel Journey
Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability
0 likes · 23 min read
Bridging the GPU Observability Gap: Why eBPF on GPUs Matters
Programmer DD
Programmer DD
Oct 12, 2025 · Backend Development

Boost Java Performance: Integrate CUDA GPU Acceleration via JNI

This guide explains why Java struggles with high‑performance or data‑intensive workloads, introduces GPU acceleration with CUDA, compares integration options such as JCuda, JNI, and JNA, walks through a practical encryption use case with performance benchmarks, and provides production‑grade best practices for memory, threading, testing, security, and deployment.

CUDAGPUHigh‑performance computing
0 likes · 23 min read
Boost Java Performance: Integrate CUDA GPU Acceleration via JNI
AI Cyberspace
AI Cyberspace
Sep 28, 2025 · Artificial Intelligence

How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11

This guide walks through configuring Windows 11 with WSL2 and Ubuntu 22.04 for GPU‑accelerated CUDA development, installing NVIDIA drivers and CUDA libraries, setting up SSH and firewall rules, running a CUDA stress‑test program, and using Nsight Systems, Nsight Compute, and NVIDIA DCGM for performance profiling and monitoring.

CUDAGPULinux
0 likes · 39 min read
How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11
Linux Kernel Journey
Linux Kernel Journey
Sep 24, 2025 · Fundamentals

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

This article explains why certain GPU performance gains require direct CUDA kernel edits and walks through fine‑grained techniques such as data‑layout restructuring, warp‑level primitives, tiled memory accesses, kernel fusion, and dynamic execution paths, backed by code examples and benchmark insights.

CUDAGPU Optimizationdynamic execution
0 likes · 12 min read
Fine-Grained GPU Code Modifications: Boosting CUDA Performance
Refining Core Development Skills
Refining Core Development Skills
Sep 11, 2025 · Fundamentals

How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power

This article examines NVIDIA's Kepler GPU architecture, highlighting its 28 nm process, increased transistor count, expanded CUDA core count, PCIe 3.0 support, enhanced memory hierarchy, new compute units, scheduling improvements like Hyper‑Q, and performance metrics of the Tesla K20X, illustrating the substantial gains over previous generations.

CUDAComputeGPU
0 likes · 13 min read
How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power
Data STUDIO
Data STUDIO
Sep 8, 2025 · Artificial Intelligence

CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration

The article explains how replacing NumPy with the GPU‑compatible CuPy library can dramatically accelerate array computations, walks through installation prerequisites, demonstrates benchmark scripts showing up to ten‑fold speed improvements, discusses data type effects, custom kernels, and hybrid CPU‑GPU workflows for large‑scale data processing.

BenchmarkCUDACuPy
0 likes · 21 min read
CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 8, 2025 · Fundamentals

How to Profile GPU Kernels with PTX Probes: From CUDA Basics to Custom Instrumentation

This article walks through GPU performance analysis, starting with CUDA architecture fundamentals, demonstrating matrix multiplication optimization, explaining PTX assembly, and introducing the Neutrino framework for programmable GPU probes that enable fine‑grained, custom instrumentation and detailed timing measurements of kernel execution.

CUDAGPUNeutrino
0 likes · 45 min read
How to Profile GPU Kernels with PTX Probes: From CUDA Basics to Custom Instrumentation
Refining Core Development Skills
Refining Core Development Skills
Aug 7, 2025 · Fundamentals

Why NVIDIA’s First Data‑Center GPU Revolutionized Computing: Inside the Tesla G80 Architecture

This article explains how NVIDIA transitioned from gaming graphics cards to general‑purpose GPUs with the first data‑center Tesla GPU, detailing the unified shader architecture, the internal components of TPCs and SMs, CUDA 1.0 programming basics, and performance calculations that illustrate the massive computational advantage over contemporary CPUs.

CUDAGPGPUGPU architecture
0 likes · 23 min read
Why NVIDIA’s First Data‑Center GPU Revolutionized Computing: Inside the Tesla G80 Architecture
AI Cyberspace
AI Cyberspace
Aug 4, 2025 · Artificial Intelligence

From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution

This article traces the evolution of NVIDIA GPU architectures—from the early Tesla series through Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and up to the upcoming Blackwell—explaining their hardware innovations, CUDA programming model, and how each generation enabled breakthroughs in high‑performance computing, deep learning, and AI applications.

AICUDAGPU
0 likes · 67 min read
From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution
MaGe Linux Operations
MaGe Linux Operations
Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI deploymentCUDAOllama
0 likes · 16 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production
Linux Kernel Journey
Linux Kernel Journey
Jul 21, 2025 · Fundamentals

Mastering CUDA GPU Performance Analysis and Tracing

This guide walks you through a complete workflow for profiling CUDA applications, covering GPU performance fundamentals, key metrics, NVIDIA Nsight tools, CUPTI programming, example code, common bottlenecks, and best‑practice recommendations to identify and eliminate performance limits.

CUDACUPTIGPU profiling
0 likes · 13 min read
Mastering CUDA GPU Performance Analysis and Tracing
Open Source Linux
Open Source Linux
Jul 16, 2025 · Artificial Intelligence

How Huawei’s New AI Chip Aims to Rival Nvidia and AMD GPUs

Huawei is developing a new AI‑focused GPU‑style chip that mirrors Nvidia and AMD architectures, aiming to ease Chinese developers’ shift from Nvidia hardware, but still faces software compatibility hurdles due to reliance on CUDA and ongoing U.S. export restrictions.

AI ChipCUDAChip Design
0 likes · 3 min read
How Huawei’s New AI Chip Aims to Rival Nvidia and AMD GPUs
Architects' Tech Alliance
Architects' Tech Alliance
Jul 13, 2025 · Artificial Intelligence

How Huawei’s New AI Chip Aims to Rival Nvidia’s GPUs

Huawei is developing a new AI chip that functions more like a general‑purpose GPU, aiming to match Nvidia and AMD architectures and simplify the transition for Chinese AI developers, while still facing challenges such as adapting CUDA‑based software and overcoming export restrictions.

AI ChipCUDAGPU
0 likes · 3 min read
How Huawei’s New AI Chip Aims to Rival Nvidia’s GPUs
Tencent Technical Engineering
Tencent Technical Engineering
Jul 8, 2025 · Artificial Intelligence

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.

AI inferenceCUDAGPGPU
0 likes · 33 min read
Why GPUs Power Large‑Model Inference: From Graphics to GPGPU
Tencent Cloud Developer
Tencent Cloud Developer
Jul 8, 2025 · Artificial Intelligence

How GPUs Power AI: From Graphics to GPGPU Explained

This article explores how GPUs evolved from graphics accelerators to general‑purpose processors for AI, detailing the CPU‑GPU heterogeneous architecture, the CUDA programming workflow, compilation into fat binaries, kernel launch mechanics, hardware components, and the differences between SIMD and SIMT models, with performance comparisons and code examples.

AICUDAGPGPU
0 likes · 31 min read
How GPUs Power AI: From Graphics to GPGPU Explained
JavaEdge
JavaEdge
Jun 28, 2025 · Backend Development

How Java Developers Can Harness CUDA on NVIDIA A100 GPUs

This guide explains why Java architects should understand CUDA, describes the GPU programming model, compares CPU and GPU designs, and details three practical ways—JNI, JCuda, and TornadoVM—to integrate CUDA acceleration into Java applications, with tips for using A100 GPUs effectively.

A100CUDAGPU
0 likes · 15 min read
How Java Developers Can Harness CUDA on NVIDIA A100 GPUs
Linux Kernel Journey
Linux Kernel Journey
Jun 9, 2025 · Fundamentals

How to Trace CUDA GPU Operations with eBPF

This tutorial explains how to build an eBPF‑based tracing tool that intercepts CUDA runtime API calls via uprobes, captures detailed event data such as memory sizes, transfer directions, kernel launches and errors, and presents it in a readable format for debugging and performance analysis.

BenchmarkCUDAGPU tracing
0 likes · 17 min read
How to Trace CUDA GPU Operations with eBPF
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jun 9, 2025 · Artificial Intelligence

How to Build High‑Performance GEMM with NVIDIA CUTLASS

The article explains why standard GEMM libraries may fall short for special matrix shapes, introduces NVIDIA’s open‑source CUTLASS library, details its hierarchical tiling architecture, and walks through a complete device‑API example that customizes tile sizes and data layouts to achieve near‑hand‑written kernel performance on modern GPUs.

CUDACUTLASSGEMM
0 likes · 6 min read
How to Build High‑Performance GEMM with NVIDIA CUTLASS
AI Algorithm Path
AI Algorithm Path
Jun 3, 2025 · Artificial Intelligence

Inside Tencent’s HunyuanVideo-Avatar: How Open‑Source AI Generates Digital Human Videos

Tencent’s HunyuanVideo-Avatar converts a static portrait and an audio clip into a lip‑synced, expressive video using a multimodal diffusion Transformer, offering open‑source weights, detailed module designs, hardware requirements, code examples, and a candid assessment of its strengths and current limitations.

AI video generationCUDAHunyuanVideo-Avatar
0 likes · 8 min read
Inside Tencent’s HunyuanVideo-Avatar: How Open‑Source AI Generates Digital Human Videos
Python Programming Learning Circle
Python Programming Learning Circle
Jun 2, 2025 · Artificial Intelligence

NVIDIA Adds Native Python Support to CUDA – What It Means for Developers

NVIDIA announced at GTC 2025 that CUDA will now natively support Python, allowing developers to write GPU‑accelerated code directly in Python without C/C++ knowledge, introducing new APIs, libraries, JIT compilation, performance tools, and a tile‑based programming model that aligns with Python’s array‑centric workflow.

AICUDAGPU
0 likes · 7 min read
NVIDIA Adds Native Python Support to CUDA – What It Means for Developers
Infra Learning Club
Infra Learning Club
Mar 23, 2025 · Artificial Intelligence

Getting Started with cuda‑python and an Introduction to cuTicle

This article explains the cuda‑python ecosystem—including its core packages, installation via pip or conda, the experimental cuda.core API, a full Python‑to‑CUDA workflow with NVRTC compilation, performance comparison to C++, the covered APIs, and an overview of NVIDIA's new cuTicle programming model.

CUDAGPUNVRTC
0 likes · 11 min read
Getting Started with cuda‑python and an Introduction to cuTicle
Infra Learning Club
Infra Learning Club
Mar 22, 2025 · Artificial Intelligence

How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025

The article reviews Nvidia GTC 2025’s session on writing CUDA kernels with Python, compares tools such as Numba, CuPy, PyTorch extensions and cuda‑python, demonstrates a segmented reduction example with C++ and Python code, explains the underlying CUDA concepts, and shows how to install and use cuda‑python to simplify kernel development.

CUDACuPyGPU
0 likes · 10 min read
How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025
Tencent Technical Engineering
Tencent Technical Engineering
Mar 21, 2025 · Fundamentals

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

CUDAGPUGPU programming
0 likes · 42 min read
Fundamentals of GPU Architecture and Programming
Infra Learning Club
Infra Learning Club
Mar 18, 2025 · Fundamentals

Can You Direct a CUDA Kernel to a Specific SM?

The article explains CUDA’s architecture and SM basics, describes how the warp scheduler and dispatch units assign thread blocks to SMs, and concludes that external control cannot target a specific SM, while mentioning the NanoFlow intra‑device parallelism approach as a possible indirect optimization.

CUDAGPU architectureKernel Scheduling
0 likes · 7 min read
Can You Direct a CUDA Kernel to a Specific SM?
AI Cyberspace
AI Cyberspace
Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI
0 likes · 33 min read
How NCCL Accelerates Distributed AI Training on GPUs
Infra Learning Club
Infra Learning Club
Feb 23, 2025 · Fundamentals

How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC

This article explains why enabling NVCC's --fatbin-options -compress-all breaks remote GPU calls, describes the fatbin file layout, shows how to extract and analyze the binary with objcopy, and provides a step‑by‑step implementation of a decompression routine for both ELF and PTX sections.

Binary FormatCUDAGPU
0 likes · 9 min read
How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC
Infra Learning Club
Infra Learning Club
Feb 22, 2025 · Fundamentals

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

This article walks through the NVCC compilation pipeline, explaining how CUDA source files are transformed into host and device binaries, detailing file extensions, compilation stages, command‑line options, intermediate artifacts, and the role of registration functions such as __nv_cudaEntityRegisterCallback and __sti____cudaRegisterAll.

CUDACompilationGPU
0 likes · 12 min read
Understanding NVCC Compilation: A Step‑by‑Step Technical Guide
Infra Learning Club
Infra Learning Club
Jan 31, 2025 · Fundamentals

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.

CUDACUDA terminologyGPU programming
0 likes · 8 min read
Essential CUDA Learning Guide: Basics, Compilation, and Profiling
Infra Learning Club
Infra Learning Club
Jan 24, 2025 · Fundamentals

Inside NVCC: How CUDA Code Is Compiled and Linked

The article dissects NVCC’s compilation pipeline, showing how internal registration functions from host_runtime.h are injected into the host binary, how a simple CUDA demo is processed with --dryrun, and how the generated fatbin, PTX, and cubin files are linked and registered for GPU execution.

CUDACompilationFatBinary
0 likes · 10 min read
Inside NVCC: How CUDA Code Is Compiled and Linked
Infra Learning Club
Infra Learning Club
Jan 23, 2025 · Cloud Native

Getting Started with GPU Remote Invocation Using rCUDA

This article introduces GPU remote invocation, explains rCUDA's architecture, walks through installing the server and client, demonstrates running CUDA samples on a GPU‑less node, and shows how to deploy rCUDA on Kubernetes with example DaemonSet and Job manifests.

CUDADockerGPU remote invocation
0 likes · 7 min read
Getting Started with GPU Remote Invocation Using rCUDA
DeWu Technology
DeWu Technology
Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDADeep LearningGPU programming
0 likes · 22 min read
Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE
AntTech
AntTech
Nov 16, 2024 · Information Security

WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025

Ant Group’s Computing Systems Lab announced that its GPU‑accelerated fully homomorphic encryption framework WarpDrive, which exploits Tensor and CUDA cores for high‑throughput NTT operations and parallel kernel designs, has been accepted as a paper at the IEEE HPCA 2025 conference.

CUDAFully Homomorphic EncryptionGPU
0 likes · 4 min read
WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025
Alibaba Cloud Native
Alibaba Cloud Native
Aug 4, 2024 · Artificial Intelligence

Step‑by‑Step Guide: Deploy the Roop AI Face‑Swap Project with Tongyi Lingma

This tutorial walks you through cloning the open‑source Roop AI face‑swap repository, setting up a conda environment, installing CUDA‑enabled PyTorch, configuring FFmpeg, and using the Tongyi Lingma AI coding assistant to explore code, resolve errors, and fine‑tune runtime parameters for successful video swapping.

AI face swapCUDARoop
0 likes · 7 min read
Step‑by‑Step Guide: Deploy the Roop AI Face‑Swap Project with Tongyi Lingma
DevOps
DevOps
Jun 13, 2024 · R&D Management

Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA

In this interview, NVIDIA founder Jensen Huang shares his management philosophy, emphasizing the value of tackling difficult tasks, maintaining a small yet empowered team, avoiding layoffs, fostering a zero‑market mindset, navigating the early challenges of CUDA, and leveraging AI to drive future innovation.

AICUDAInnovation
0 likes · 12 min read
Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA
IT Services Circle
IT Services Circle
May 2, 2024 · Artificial Intelligence

LLM.c: A 1000‑Line C Implementation for Training GPT‑2

Andrej Karpathy’s LLM.c project demonstrates how a compact, pure‑C (and CUDA) codebase of roughly 1000 lines can train a GPT‑2 model, covering data preparation, memory management, layer implementations, compilation, and practical tips for running and testing the model on CPUs and GPUs.

AICCUDA
0 likes · 10 min read
LLM.c: A 1000‑Line C Implementation for Training GPT‑2
NewBeeNLP
NewBeeNLP
Apr 11, 2024 · Artificial Intelligence

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Andrej Karpathy released LLM.C, a pure C/CUDA implementation that trains GPT‑2‑style models in about 1,000 lines of code, detailing manual forward/backward passes, memory allocation tricks, SIMD CPU acceleration, CUDA porting, and migration tutorials, while comparing it to PyTorch and discussing broader LLM OS implications.

C programmingCUDAGPT
0 likes · 6 min read
How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework
Architects' Tech Alliance
Architects' Tech Alliance
Jun 20, 2023 · Fundamentals

Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing

NVIDIA's new DOCA GPUNetIO library enables GPU‑initiated communication, allowing packets to be received directly into GPU memory, processed by CUDA kernels, and sent without CPU involvement, offering lower latency, higher scalability, and detailed pipeline examples including IP checksum, HTTP filtering, traffic forwarding, and 5G Aerial SDK integration.

5GCUDADOCA
0 likes · 19 min read
Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
DeWu Technology
DeWu Technology
Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference
0 likes · 14 min read
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT
Shopee Tech Team
Shopee Tech Team
Jun 2, 2022 · Backend Development

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

The Shopee Off‑Platform Ads team built a GPU‑accelerated Creative Rendering System that uses a four‑layer architecture, CGO‑bridged C/C++ kernels, and template caching to process billions of product images daily, achieving roughly ten‑fold speedup, half the cost, and far reduced rack space while handling high concurrency.

AdvertisingCUDAGPU
0 likes · 23 min read
Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads
Liangxu Linux
Liangxu Linux
Aug 17, 2021 · Cloud Native

How to Enable GPU Acceleration in Docker on Linux

This guide walks you through installing NVIDIA drivers, CUDA, and nvidia-docker2 on a Linux host, configuring Docker to access the GPU, and verifying the setup with commands and sample TensorFlow/PyTorch code, enabling deep‑learning workloads inside containers.

CUDADeep LearningDocker
0 likes · 7 min read
How to Enable GPU Acceleration in Docker on Linux
MaGe Linux Operations
MaGe Linux Operations
Jul 26, 2021 · Fundamentals

Boost NumPy Performance 10× with CuPy: GPU Acceleration Guide

This article explains how CuPy mirrors NumPy's API to run array and matrix operations on NVIDIA GPUs, providing step‑by‑step installation, code examples, and benchmark results that demonstrate speedups ranging from 10× to over 700× compared to CPU‑only NumPy.

CUDACuPyGPU Acceleration
0 likes · 5 min read
Boost NumPy Performance 10× with CuPy: GPU Acceleration Guide
TiPaiPai Technical Team
TiPaiPai Technical Team
Jun 25, 2021 · Artificial Intelligence

Mastering TensorRT: Deploy Deep Learning Models Efficiently

This article introduces TensorRT, explains its deployment workflow from model training to engine generation, shows how to register custom operators for ONNX and create TensorRT plugins, and explores deformable convolution (DCN) implementation strategies for high‑performance AI inference.

AI inferenceCUDACustom Operators
0 likes · 8 min read
Mastering TensorRT: Deploy Deep Learning Models Efficiently
DataFunTalk
DataFunTalk
Jun 13, 2021 · Artificial Intelligence

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

CUDAGPU virtualizationKubernetes
0 likes · 16 min read
GPU Virtual Sharing for AI Inference Services on Kubernetes
iQIYI Technical Product Team
iQIYI Technical Product Team
May 28, 2021 · Artificial Intelligence

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.

AI inferenceCUDAGPU virtualization
0 likes · 16 min read
iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling
Architects' Tech Alliance
Architects' Tech Alliance
Mar 20, 2021 · Fundamentals

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

This article outlines the progression of NVIDIA GPU architectures—from the early Fermi and Kepler designs through Maxwell, Pascal, Volta, Turing, and the latest Ampere—detailing compute capabilities, SM structures, FP64/FP32 ratios, Tensor Core introductions, and their impact on AI and high‑performance computing.

AICUDAGPU architecture
0 likes · 19 min read
Evolution of NVIDIA GPU Architectures from Fermi to Ampere
Architects' Tech Alliance
Architects' Tech Alliance
Mar 15, 2021 · Artificial Intelligence

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

This article provides a comprehensive overview of NVIDIA's GPU architecture evolution—covering Fermi, Kepler, Maxwell, Pascal, Volta, Turing, and Ampere—detailing compute capabilities, SM structures, specialized units such as Tensor Cores, and their impact on AI and high‑performance computing workloads.

AICUDAGPU
0 likes · 19 min read
Evolution of NVIDIA GPU Architectures from Fermi to Ampere
Programmer DD
Programmer DD
Dec 6, 2020 · Cloud Native

Enable GPU Support in Kubernetes with Containerd and NVIDIA Runtime

This guide walks through installing NVIDIA drivers, CUDA toolkit, nvidia-container-runtime, configuring Containerd, deploying the NVIDIA device plugin, and testing GPU access inside Kubernetes pods, providing a complete solution for GPU workloads on containerd‑based clusters.

CUDADevice PluginsGPU
0 likes · 11 min read
Enable GPU Support in Kubernetes with Containerd and NVIDIA Runtime
TAL Education Technology
TAL Education Technology
May 14, 2020 · Artificial Intelligence

An Introduction to GPU Computing and CUDA Architecture

This article provides a concise overview of GPU computing fundamentals, covering GPU hardware components, memory hierarchy, parallel execution models, and the CUDA programming framework, illustrating how CPUs and GPUs cooperate in heterogeneous computing environments.

CUDACUDA programmingGPU
0 likes · 16 min read
An Introduction to GPU Computing and CUDA Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Dec 28, 2019 · Artificial Intelligence

Understanding CPU vs GPU, GPU Parameters, and NVIDIA Architectures for AI and High‑Performance Computing

The article explains how CPUs and GPUs differ in architecture and workload handling, details key GPU specifications such as CUDA cores, memory bandwidth and floating‑point precision, reviews NVIDIA's product families and architectural evolution, and highlights the role of GPUs in deep learning training and inference while also mentioning a related technical ebook promotion.

AICPUCUDA
0 likes · 13 min read
Understanding CPU vs GPU, GPU Parameters, and NVIDIA Architectures for AI and High‑Performance Computing
Architects' Tech Alliance
Architects' Tech Alliance
Dec 21, 2019 · Fundamentals

GPU Overview, Usage Methods, and Virtualization Technologies

This article explains the definition and history of GPUs, why dedicated graphics processors are needed, how they are accessed through graphics libraries and vendor APIs such as OpenGL, DirectX, CUDA and OpenCL, and describes various GPU virtualization techniques including virtual graphics cards, passthrough, and vCUDA with their client‑server‑manager architecture.

CUDAComputeGPU
0 likes · 20 min read
GPU Overview, Usage Methods, and Virtualization Technologies
360 Quality & Efficiency
360 Quality & Efficiency
Dec 6, 2019 · Artificial Intelligence

Accelerating OpenCV Image Matching with GPU (CUDA) in Python

This article demonstrates how compiling OpenCV 3.2 with CUDA 8.0 enables GPU‑accelerated template matching in Python, reducing average processing time from 0.299 seconds on CPU to 0.181 seconds on GPU—a 39.4% performance gain for automated testing image‑recognition APIs.

CUDAGPUOpenCV
0 likes · 3 min read
Accelerating OpenCV Image Matching with GPU (CUDA) in Python
Architects' Tech Alliance
Architects' Tech Alliance
Apr 21, 2019 · Fundamentals

Differences Between CPU and GPU Architectures and the Relationship Between OpenCL and CUDA

This article explains the fundamental architectural differences between CPUs and GPUs, their design goals and performance characteristics, and compares OpenCL and CUDA, highlighting OpenCL’s cross‑platform flexibility versus CUDA’s NVIDIA‑specific optimization, while illustrating how each fits various parallel computing tasks.

CPUCUDAGPU
0 likes · 7 min read
Differences Between CPU and GPU Architectures and the Relationship Between OpenCL and CUDA