Tagged articles

CUDA

112 articles · Page 1 of 2

Jun 26, 2026 · Industry Insights

Qualcomm's $3.9B Modular Acquisition Aims to Close AI Software Gap and Challenge CUDA

Qualcomm announced a $3.9 billion all‑stock purchase of AI infrastructure software firm Modular, whose cross‑hardware MAX inference engine and Mojo language aim to fill Qualcomm’s AI software shortfall, reduce reliance on CUDA, and support a broader cloud‑to‑edge AI ecosystem.

AI InfrastructureCUDAMAX engine

0 likes · 9 min read

Qualcomm's $3.9B Modular Acquisition Aims to Close AI Software Gap and Challenge CUDA

Ubuntu

Jun 15, 2026 · Artificial Intelligence

Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide

This guide shows how to enable NVIDIA GPU passthrough in WSL 2, install the CUDA toolkit, set up a PyTorch GPU environment, verify GPU visibility, and run real‑world AI/ML workloads such as LLM inference, YOLO object detection, and Jupyter monitoring, while providing performance comparisons, optimization tips, and troubleshooting FAQs.

AICUDAGPU

0 likes · 13 min read

Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide

DeepHub IMBA

Jun 4, 2026 · Artificial Intelligence

Hand‑Writing a Triton Softmax Kernel: Program Instances, Block Size, Masking & Pointer Arithmetic

This article walks through implementing a row‑wise softmax kernel in Triton, explaining program‑instance mapping, block‑size selection, mask handling, pointer arithmetic, resource‑usage analysis, and a RTX 5090 benchmark that reveals performance cliffs compared to PyTorch.

CUDAGPU kernelPython

0 likes · 9 min read

Hand‑Writing a Triton Softmax Kernel: Program Instances, Block Size, Masking & Pointer Arithmetic

Lao Guo's Learning Space

Jun 3, 2026 · Industry Insights

Can Apple’s M5 Ultra Still Compete After NVIDIA’s RTX Spark Launch?

The RTX Spark desktop processor delivers 1 PFLOP of AI compute—about 14 times the M5 Ultra—while the M5 Ultra retains a three‑times higher memory bandwidth and twice the memory capacity, making it superior for certain inference workloads; the article breaks down specs, benchmarks, ecosystem differences, pricing and market positioning to show how each platform fits distinct AI use cases.

AI computeApple M5 UltraCUDA

0 likes · 12 min read

Can Apple’s M5 Ultra Still Compete After NVIDIA’s RTX Spark Launch?

Machine Heart

May 24, 2026 · Artificial Intelligence

Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

CODA rewrites Transformer blocks as GEMM‑epilogue programs, exposing five primitive building blocks that let both AI‑generated code and human programmers fuse memory‑intensive operations into the GEMM epilogue, eliminating costly tensor moves and achieving up to 1.8× speed‑ups on H100 GPUs for RMSNorm, SwiGLU, RoPE and other components, while preserving numerical accuracy.

CODACUDAGEMM

0 likes · 11 min read

Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing

0 likes · 8 min read

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing

0 likes · 9 min read

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

Machine Heart

May 2, 2026 · Industry Insights

Beyond CUDA: Nvidia’s Token Factory and Supply Chain Guard Its Moat from TPU

The article examines Nvidia’s competitive moat beyond CUDA, detailing how its token‑factory model, extensive supply‑chain commitments, and a flexible accelerator ecosystem contrast with Google’s TPU ASIC approach, while also exploring the impact of AI agents on future compute demand.

AI hardwareCUDANVIDIA

0 likes · 7 min read

Beyond CUDA: Nvidia’s Token Factory and Supply Chain Guard Its Moat from TPU

Raymond Ops

Apr 27, 2026 · Artificial Intelligence

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.

CUDAGPU memoryLLM serving

0 likes · 22 min read

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

CodeTrend

Apr 19, 2026 · Artificial Intelligence

Understanding NVIDIA Jetpack: Design Framework, Architecture, and Flashing Process

This article explains NVIDIA Jetpack’s three‑layer architecture, its relationship with the SDK Manager installer, step‑by‑step flashing procedures for Jetson devices, common failure points such as the 35.29% stall, and practical troubleshooting and hybrid manual‑automatic solutions.

CUDAEmbedded AIFlashing

0 likes · 11 min read

Understanding NVIDIA Jetpack: Design Framework, Architecture, and Flashing Process

TonyBai

Apr 17, 2026 · Industry Insights

The 30‑Year Journey: From Parallel Computing to Modern GPU‑Powered AI

This article traces three decades of government‑funded research in parallel computing, graphics systems, and stream processing, showing how those advances migrated to companies like Nvidia, evolved into CUDA and other GPU technologies, and ultimately enabled today’s AI revolution.

AICUDAGPU computing

0 likes · 18 min read

The 30‑Year Journey: From Parallel Computing to Modern GPU‑Powered AI

Machine Learning Algorithms & Natural Language Processing

Apr 15, 2026 · Artificial Intelligence

Industrial Code LLM Learns to Think Before Writing – InCoder-32B Thinking Tackles Verilog and CUDA Pitfalls

The article analyzes InCoder-32B Thinking, an industrial‑code large language model that incorporates error‑driven chain‑of‑thought and an Industrial Code World Model to predict execution outcomes, adapt reasoning depth, and achieve high accuracy across diverse hardware‑centric benchmarks.

CUDALarge Language ModelVerilog

0 likes · 7 min read

Industrial Code LLM Learns to Think Before Writing – InCoder-32B Thinking Tackles Verilog and CUDA Pitfalls

Machine Heart

Apr 14, 2026 · Artificial Intelligence

When Verilog and CUDA Fail: How Industrial Code Models Are Learning to Think Before They Write

The article analyzes InCoder-32B Thinking, an industrial code large model that integrates error‑driven chain‑of‑thought and a world‑model to predict real‑system outcomes, showing high accuracy on diverse benchmarks and demonstrating adaptive reasoning depth for tasks ranging from Verilog synthesis to CUDA kernel optimization.

AICUDAVerilog

0 likes · 8 min read

When Verilog and CUDA Fail: How Industrial Code Models Are Learning to Think Before They Write

SuanNi

Mar 29, 2026 · Artificial Intelligence

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.

AICUDAGPU Optimization

0 likes · 13 min read

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

AI Info Trend

Mar 24, 2026 · Industry Insights

NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise

The GTC 2026 keynote revealed NVIDIA’s latest DLSS 5 technology using 3‑D guided neural rendering to deliver cinematic‑quality graphics in real time, outlined a 20‑year CUDA ecosystem flywheel that fuels AI acceleration across structured and unstructured data, showcased enterprise case studies like Nestlé’s data‑refresh breakthrough, and highlighted a vast partner network, illustrating how AI is moving from experimental labs to everyday production.

AICUDADLSS

0 likes · 5 min read

NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

CUDA Agent, a large‑scale reinforcement‑learning system from ByteDance and Tsinghua, can automatically generate and optimize CUDA kernels that outperform torch.compile by up to 2× on simple kernels and achieve around 40% higher speed than proprietary models on the hardest benchmarks, while detailing its data‑synthesis pipeline, training workflow, and current limitations.

CUDAGPU OptimizationKernelBench

0 likes · 10 min read

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

AI Explorer

Mar 3, 2026 · Artificial Intelligence

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

ByteDance and Tsinghua University have created the CUDA Agent, an AI compiler that automatically writes and optimizes GPU kernels, delivering up to double the performance, and heralding a shift where AI‑generated low‑level code could reshape the hardware‑software competition landscape.

AI compilerByteDanceCUDA

0 likes · 6 min read

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

AI Engineering

Feb 27, 2026 · Artificial Intelligence

Ubuntu 26.04 LTS Optimized for Local AI with Plug‑and‑Play GPU Drivers and Sandbox Inference

Ubuntu 26.04 LTS adds automatic detection and installation of NVIDIA CUDA or AMD ROCm drivers and introduces pre‑configured Inference Snaps sandbox containers, building on the AI groundwork laid by 24.04 LTS to dramatically lower the setup barrier for local AI development.

AI inferenceCUDAGPU drivers

0 likes · 4 min read

Ubuntu 26.04 LTS Optimized for Local AI with Plug‑and‑Play GPU Drivers and Sandbox Inference

HyperAI Super Neural

Feb 4, 2026 · Artificial Intelligence

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

The article walks through a step‑by‑step optimization of a simple elementwise addition kernel (C = A + B) on HyperAI's RTX 5090 cloud instance, covering FP32 baseline, vectorized FP32, several FP16 variants, benchmark methodology, performance results, and the reasoning behind thread‑block sizing.

CUDAElementwiseFP16

0 likes · 30 min read

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

21CTO

Jan 26, 2026 · Artificial Intelligence

What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements

PyTorch 2.10 introduces extensive upgrades for AMD ROCm, Intel XPU, and NVIDIA CUDA, adds new Torch XPU APIs, expands Python 3.14 support, and brings performance‑focused improvements such as fused kernels and enhanced quantization, all available via the official GitHub release.

CUDAGPUPyTorch

0 likes · 4 min read

What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements

TonyBai

Jan 21, 2026 · Artificial Intelligence

When Go Meets GPU: A Hands‑On Guide to Unlocking Thousand‑Fold Compute with CUDA

This article walks Go developers through the fundamentals of GPU architecture and CUDA, demonstrates a complete CGO‑based matrix‑multiplication project, offers performance‑tuning tips such as minimizing PCIe transfers and leveraging shared memory, and presents a PureGo alternative for seamless Go‑GPU integration.

CGOCUDAGPU computing

0 likes · 17 min read

When Go Meets GPU: A Hands‑On Guide to Unlocking Thousand‑Fold Compute with CUDA

HyperAI Super Neural

Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming

0 likes · 12 min read

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

AI2ML AI to Machine Learning

Dec 16, 2025 · Industry Insights

Why Computer Science Majors Must Embrace a Massive Paradigm Shift

The article argues that traditional storage‑centric computer science curricula are becoming obsolete as AI‑driven, compute‑centric paradigms dominate hardware, data‑center operations, and software ecosystems, urging universities and students to rapidly adopt new teaching focus and skills.

AI hardwareCUDAassociative memory

0 likes · 10 min read

Why Computer Science Majors Must Embrace a Massive Paradigm Shift

Raymond Ops

Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU

0 likes · 15 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

Java Tech Enthusiast

Dec 8, 2025 · Artificial Intelligence

Explore CUDA Toolkit 13.1: CUDA Tile, Green Contexts, and Performance Boosts

NVIDIA's CUDA Toolkit 13.1 introduces the groundbreaking CUDA Tile programming model, green context support, enhanced math libraries, and numerous performance improvements for AI and GPU workloads, while also adding new developer tools, MPS features, and deterministic options for CUB.

CUDACUDA TileGPU programming

0 likes · 16 min read

Explore CUDA Toolkit 13.1: CUDA Tile, Green Contexts, and Performance Boosts

Linux Kernel Journey

Dec 7, 2025 · Fundamentals

CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling

This article explains the fundamentals of CUDA performance tuning, covering GPU architectures from Kepler to Volta, the role of SMX, warp schedulers, registers and memory hierarchies, and provides practical guidance on launch configuration, latency hiding, and thread‑block sizing to maximize throughput.

CUDAGPU architecturePerformance Optimization

0 likes · 21 min read

CUDA Optimization Basics: Understanding GPU Architecture and Warp Scheduling

Python Programming Learning Circle

Oct 28, 2025 · Artificial Intelligence

Why Nvidia Is Making Python a First‑Class Citizen in CUDA

Nvidia announced native Python support for its CUDA toolkit, detailing new Python‑centric APIs, projects like CuTile and Cutlass, and a layered strategy that democratizes GPU programming for AI developers while preserving performance and expanding the ecosystem.

AICUDAGPU

0 likes · 10 min read

Why Nvidia Is Making Python a First‑Class Citizen in CUDA

Linux Kernel Journey

Oct 24, 2025 · Fundamentals

Mastering CUDA Function Type Annotations: A Complete Guide

This article provides a comprehensive overview of CUDA function type annotations—including __global__, __device__, __host__, combined annotations, and memory‑space qualifiers—explains their purposes, characteristics, and syntax, demonstrates practical examples, offers best‑practice guidelines, highlights common pitfalls, and introduces advanced topics such as dynamic parallelism and cooperative groups.

CUDAGPU programmingdevice functions

0 likes · 14 min read

Mastering CUDA Function Type Annotations: A Complete Guide

Linux Kernel Journey

Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability

0 likes · 23 min read

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

BirdNest Tech Talk

Oct 15, 2025 · Artificial Intelligence

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

This article walks through the DeepSeek‑V3.2‑Exp inference codebase, detailing its MoE architecture, Multi‑Head Latent Attention, FP8 quantization, custom CUDA kernels, and 8‑GPU NCCL‑based distributed execution from initialization through prefill and decode stages.

CUDADistributed InferenceFP8 quantization

0 likes · 9 min read

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

Programmer DD

Oct 12, 2025 · Backend Development

Boost Java Performance: Integrate CUDA GPU Acceleration via JNI

This guide explains why Java struggles with high‑performance or data‑intensive workloads, introduces GPU acceleration with CUDA, compares integration options such as JCuda, JNI, and JNA, walks through a practical encryption use case with performance benchmarks, and provides production‑grade best practices for memory, threading, testing, security, and deployment.

CUDAGPUHigh-performance computing

0 likes · 23 min read

Boost Java Performance: Integrate CUDA GPU Acceleration via JNI

Linux Kernel Journey

Sep 28, 2025 · Fundamentals

Low‑Latency GPU Packet Processing: Techniques, Trade‑offs, and Benchmarks

This article examines how to achieve low‑latency network packet processing on NVIDIA GPUs by comparing CPU and GPU implementations, exploring memory optimizations, batch strategies, stream concurrency, persistent kernels, and CUDA graphs, and presenting detailed performance measurements for each technique.

CUDAGPUPerformance Optimization

0 likes · 12 min read

Low‑Latency GPU Packet Processing: Techniques, Trade‑offs, and Benchmarks

AI Cyberspace

Sep 28, 2025 · Artificial Intelligence

How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11

This guide walks through configuring Windows 11 with WSL2 and Ubuntu 22.04 for GPU‑accelerated CUDA development, installing NVIDIA drivers and CUDA libraries, setting up SSH and firewall rules, running a CUDA stress‑test program, and using Nsight Systems, Nsight Compute, and NVIDIA DCGM for performance profiling and monitoring.

CUDAGPULinux

0 likes · 39 min read

Architect's Alchemy Furnace

Sep 27, 2025 · Artificial Intelligence

How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

This guide walks you through configuring a high‑performance AI inference server on Oracle Linux, covering hardware specs, NVIDIA driver and CUDA installation, Conda environment setup, Xinference deployment, service startup, and example model loading commands, all with clear code snippets and images.

AI inferenceCUDAConda

0 likes · 10 min read

How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

Linux Kernel Journey

Sep 24, 2025 · Fundamentals

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

This article explains why certain GPU performance gains require direct CUDA kernel edits and walks through fine‑grained techniques such as data‑layout restructuring, warp‑level primitives, tiled memory accesses, kernel fusion, and dynamic execution paths, backed by code examples and benchmark insights.

CUDAGPU Optimizationdynamic execution

0 likes · 12 min read

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

Refining Core Development Skills

Sep 11, 2025 · Fundamentals

How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power

This article examines NVIDIA's Kepler GPU architecture, highlighting its 28 nm process, increased transistor count, expanded CUDA core count, PCIe 3.0 support, enhanced memory hierarchy, new compute units, scheduling improvements like Hyper‑Q, and performance metrics of the Tesla K20X, illustrating the substantial gains over previous generations.

CUDAComputeGPU

0 likes · 13 min read

How Kepler Boosted GPU Performance: Architecture, Specs, and Compute Power

Data STUDIO

Sep 8, 2025 · Artificial Intelligence

CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration

The article explains how replacing NumPy with the GPU‑compatible CuPy library can dramatically accelerate array computations, walks through installation prerequisites, demonstrates benchmark scripts showing up to ten‑fold speed improvements, discusses data type effects, custom kernels, and hybrid CPU‑GPU workflows for large‑scale data processing.

CUDACuPyGPU Acceleration

0 likes · 21 min read

CuPy vs NumPy: Achieving Over 10× Speedup with GPU Acceleration

Alibaba Cloud Developer

Sep 8, 2025 · Fundamentals

How to Profile GPU Kernels with PTX Probes: From CUDA Basics to Custom Instrumentation

This article walks through GPU performance analysis, starting with CUDA architecture fundamentals, demonstrating matrix multiplication optimization, explaining PTX assembly, and introducing the Neutrino framework for programmable GPU probes that enable fine‑grained, custom instrumentation and detailed timing measurements of kernel execution.

CUDAGPUNeutrino

0 likes · 45 min read

Refining Core Development Skills

Aug 7, 2025 · Fundamentals

Why NVIDIA’s First Data‑Center GPU Revolutionized Computing: Inside the Tesla G80 Architecture

This article explains how NVIDIA transitioned from gaming graphics cards to general‑purpose GPUs with the first data‑center Tesla GPU, detailing the unified shader architecture, the internal components of TPCs and SMs, CUDA 1.0 programming basics, and performance calculations that illustrate the massive computational advantage over contemporary CPUs.

CUDAGPGPUGPU architecture

0 likes · 23 min read

Why NVIDIA’s First Data‑Center GPU Revolutionized Computing: Inside the Tesla G80 Architecture

AI Cyberspace

Aug 4, 2025 · Artificial Intelligence

From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution

This article traces the evolution of NVIDIA GPU architectures—from the early Tesla series through Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, and up to the upcoming Blackwell—explaining their hardware innovations, CUDA programming model, and how each generation enabled breakthroughs in high‑performance computing, deep learning, and AI applications.

AICUDAGPU

0 likes · 67 min read

From Tesla to Hopper: How NVIDIA GPU Architectures Powered the AI Revolution

MaGe Linux Operations

Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI DeploymentCUDAOllama

0 likes · 16 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

Linux Kernel Journey

Jul 21, 2025 · Fundamentals

Mastering CUDA GPU Performance Analysis and Tracing

This guide walks you through a complete workflow for profiling CUDA applications, covering GPU performance fundamentals, key metrics, NVIDIA Nsight tools, CUPTI programming, example code, common bottlenecks, and best‑practice recommendations to identify and eliminate performance limits.

CUDACUPTIGPU profiling

0 likes · 13 min read

Mastering CUDA GPU Performance Analysis and Tracing

Open Source Linux

Jul 16, 2025 · Artificial Intelligence

How Huawei’s New AI Chip Aims to Rival Nvidia and AMD GPUs

Huawei is developing a new AI‑focused GPU‑style chip that mirrors Nvidia and AMD architectures, aiming to ease Chinese developers’ shift from Nvidia hardware, but still faces software compatibility hurdles due to reliance on CUDA and ongoing U.S. export restrictions.

AI chipCUDAGPU

0 likes · 3 min read

How Huawei’s New AI Chip Aims to Rival Nvidia and AMD GPUs

Network Intelligence Research Center (NIRC)

Jul 15, 2025 · Fundamentals

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

CUDAGPU programmingPyTorch

0 likes · 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

Architects' Tech Alliance

Jul 13, 2025 · Artificial Intelligence

How Huawei’s New AI Chip Aims to Rival Nvidia’s GPUs

Huawei is developing a new AI chip that functions more like a general‑purpose GPU, aiming to match Nvidia and AMD architectures and simplify the transition for Chinese AI developers, while still facing challenges such as adapting CUDA‑based software and overcoming export restrictions.

AI chipCUDAGPU

0 likes · 3 min read

How Huawei’s New AI Chip Aims to Rival Nvidia’s GPUs

Tencent Technical Engineering

Jul 8, 2025 · Artificial Intelligence

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.

AI inferenceCUDAGPGPU

0 likes · 33 min read

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

Tencent Cloud Developer

Jul 8, 2025 · Artificial Intelligence

How GPUs Power AI: From Graphics to GPGPU Explained

This article explores how GPUs evolved from graphics accelerators to general‑purpose processors for AI, detailing the CPU‑GPU heterogeneous architecture, the CUDA programming workflow, compilation into fat binaries, kernel launch mechanics, hardware components, and the differences between SIMD and SIMT models, with performance comparisons and code examples.

AICUDAGPGPU

0 likes · 31 min read

How GPUs Power AI: From Graphics to GPGPU Explained

JavaEdge

Jun 28, 2025 · Backend Development

How Java Developers Can Harness CUDA on NVIDIA A100 GPUs

This guide explains why Java architects should understand CUDA, describes the GPU programming model, compares CPU and GPU designs, and details three practical ways—JNI, JCuda, and TornadoVM—to integrate CUDA acceleration into Java applications, with tips for using A100 GPUs effectively.

A100CUDAGPU

0 likes · 15 min read

How Java Developers Can Harness CUDA on NVIDIA A100 GPUs

Linux Kernel Journey

Jun 9, 2025 · Fundamentals

How to Trace CUDA GPU Operations with eBPF

This tutorial explains how to build an eBPF‑based tracing tool that intercepts CUDA runtime API calls via uprobes, captures detailed event data such as memory sizes, transfer directions, kernel launches and errors, and presents it in a readable format for debugging and performance analysis.

CUDAGPU TracingLinux

0 likes · 17 min read

How to Trace CUDA GPU Operations with eBPF

Network Intelligence Research Center (NIRC)

Jun 9, 2025 · Artificial Intelligence

How to Build High‑Performance GEMM with NVIDIA CUTLASS

The article explains why standard GEMM libraries may fall short for special matrix shapes, introduces NVIDIA’s open‑source CUTLASS library, details its hierarchical tiling architecture, and walks through a complete device‑API example that customizes tile sizes and data layouts to achieve near‑hand‑written kernel performance on modern GPUs.

CUDACutlassGEMM

0 likes · 6 min read

How to Build High‑Performance GEMM with NVIDIA CUTLASS

AI Algorithm Path

Jun 3, 2025 · Artificial Intelligence

Inside Tencent’s HunyuanVideo-Avatar: How Open‑Source AI Generates Digital Human Videos

Tencent’s HunyuanVideo-Avatar converts a static portrait and an audio clip into a lip‑synced, expressive video using a multimodal diffusion Transformer, offering open‑source weights, detailed module designs, hardware requirements, code examples, and a candid assessment of its strengths and current limitations.

AI video generationCUDAHunyuanVideo-Avatar

0 likes · 8 min read

Python Programming Learning Circle

Jun 2, 2025 · Artificial Intelligence

NVIDIA Adds Native Python Support to CUDA – What It Means for Developers

NVIDIA announced at GTC 2025 that CUDA will now natively support Python, allowing developers to write GPU‑accelerated code directly in Python without C/C++ knowledge, introducing new APIs, libraries, JIT compilation, performance tools, and a tile‑based programming model that aligns with Python’s array‑centric workflow.

AICUDAGPU

0 likes · 7 min read

NVIDIA Adds Native Python Support to CUDA – What It Means for Developers

Java Tech Enthusiast

May 9, 2025 · Industry Insights

Why NVIDIA’s Native Python Support in CUDA Could Revolutionize GPU Computing

NVIDIA announced native Python support in its CUDA toolkit, enabling developers to write GPU‑accelerated code directly in Python, detailing the new programming model, JIT‑based architecture, performance benefits, and the broader impact on AI development and the developer ecosystem.

AICUDAGPU

0 likes · 15 min read

Why NVIDIA’s Native Python Support in CUDA Could Revolutionize GPU Computing

Architects' Tech Alliance

Apr 16, 2025 · Industry Insights

How Do AI Chip Platforms Stack Up? A Deep Dive into CUDA, CANN, Neuware, and ROCm

This article analyzes the major AI system‑level compute platforms—NVIDIA's CUDA, Huawei's CANN, Cambricon's Neuware, and AMD's ROCm—examining their architectures, ecosystem support, performance features, compatibility layers, and how they shape the AI chip market.

AIAnalysisCANN

0 likes · 16 min read

How Do AI Chip Platforms Stack Up? A Deep Dive into CUDA, CANN, Neuware, and ROCm

360 Zhihui Cloud Developer

Apr 1, 2025 · Artificial Intelligence

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.

CUDADeepGEMMFP8

0 likes · 15 min read

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

Tencent Technical Engineering

Mar 31, 2025 · Artificial Intelligence

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

This step‑by‑step tutorial shows how to set up CUDA 12.4, install required packages, prepare a JSON dataset and custom reward, troubleshoot out‑of‑memory errors, and launch DeepSeek R1 training on an 8‑GPU A100 cluster using Accelerate, Deepspeed zero‑3 and vLLM configurations.

A100CUDADeepSeek

0 likes · 9 min read

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

Infra Learning Club

Mar 23, 2025 · Artificial Intelligence

Getting Started with cuda‑python and an Introduction to cuTicle

This article explains the cuda‑python ecosystem—including its core packages, installation via pip or conda, the experimental cuda.core API, a full Python‑to‑CUDA workflow with NVRTC compilation, performance comparison to C++, the covered APIs, and an overview of NVIDIA's new cuTicle programming model.

CUDAGPUNVIDIA

0 likes · 11 min read

Getting Started with cuda‑python and an Introduction to cuTicle

Infra Learning Club

Mar 22, 2025 · Artificial Intelligence

How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025

The article reviews Nvidia GTC 2025’s session on writing CUDA kernels with Python, compares tools such as Numba, CuPy, PyTorch extensions and cuda‑python, demonstrates a segmented reduction example with C++ and Python code, explains the underlying CUDA concepts, and shows how to install and use cuda‑python to simplify kernel development.

CUDACuPyGPU

0 likes · 10 min read

How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025

Tencent Technical Engineering

Mar 21, 2025 · Fundamentals

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

CUDAGPUGPU programming

0 likes · 42 min read

Fundamentals of GPU Architecture and Programming

Infra Learning Club

Mar 18, 2025 · Fundamentals

Can You Direct a CUDA Kernel to a Specific SM?

The article explains CUDA’s architecture and SM basics, describes how the warp scheduler and dispatch units assign thread blocks to SMs, and concludes that external control cannot target a specific SM, while mentioning the NanoFlow intra‑device parallelism approach as a possible indirect optimization.

CUDAGPU architectureKernel Scheduling

0 likes · 7 min read

Can You Direct a CUDA Kernel to a Specific SM?

AI Cyberspace

Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI

0 likes · 33 min read

How NCCL Accelerates Distributed AI Training on GPUs

Infra Learning Club

Feb 23, 2025 · Fundamentals

How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC

This article explains why enabling NVCC's --fatbin-options -compress-all breaks remote GPU calls, describes the fatbin file layout, shows how to extract and analyze the binary with objcopy, and provides a step‑by‑step implementation of a decompression routine for both ELF and PTX sections.

Binary FormatCUDAGPU

0 likes · 9 min read

How to Dynamically Decompress CUDA Fatbin Files Compressed by NVCC

Infra Learning Club

Feb 22, 2025 · Fundamentals

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

This article walks through the NVCC compilation pipeline, explaining how CUDA source files are transformed into host and device binaries, detailing file extensions, compilation stages, command‑line options, intermediate artifacts, and the role of registration functions such as __nv_cudaEntityRegisterCallback and __sti____cudaRegisterAll.

CUDACompilationGPU

0 likes · 12 min read

Understanding NVCC Compilation: A Step‑by‑Step Technical Guide

Infra Learning Club

Jan 31, 2025 · Fundamentals

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

This article walks through a practical APOD workflow for CUDA development—assessing bottlenecks, parallelizing with cuBLAS/cuFFT/Thrust, optimizing iteratively, and deploying—while covering nvcc compilation flags, PTX virtual ISA, nvprof profiling, core terminology (SP, SM, warp, grid, block, thread), indexing patterns, and unified memory references.

CUDACUDA terminologyGPU programming

0 likes · 8 min read

Essential CUDA Learning Guide: Basics, Compilation, and Profiling

Infra Learning Club

Jan 24, 2025 · Fundamentals

Inside NVCC: How CUDA Code Is Compiled and Linked

The article dissects NVCC’s compilation pipeline, showing how internal registration functions from host_runtime.h are injected into the host binary, how a simple CUDA demo is processed with --dryrun, and how the generated fatbin, PTX, and cubin files are linked and registered for GPU execution.

CUDACompilationFatBinary

0 likes · 10 min read

Inside NVCC: How CUDA Code Is Compiled and Linked

Infra Learning Club

Jan 23, 2025 · Cloud Native

Getting Started with GPU Remote Invocation Using rCUDA

This article introduces GPU remote invocation, explains rCUDA's architecture, walks through installing the server and client, demonstrates running CUDA samples on a GPU‑less node, and shows how to deploy rCUDA on Kubernetes with example DaemonSet and Job manifests.

CUDADockerGPU remote invocation

0 likes · 7 min read

Getting Started with GPU Remote Invocation Using rCUDA

Infra Learning Club

Jan 22, 2025 · Fundamentals

User‑Mode vs Kernel‑Mode GPU Virtualization: Architecture, Benefits, and Limits

The article compares user‑mode and kernel‑mode GPU virtualization, detailing their layered architectures, how they intercept APIs, the advantages such as openness, isolation, and unified memory, and the drawbacks including API complexity, kernel intrusion, legal risks, and cross‑process limitations.

API interceptionCUDAGPU virtualization

0 likes · 5 min read

User‑Mode vs Kernel‑Mode GPU Virtualization: Architecture, Benefits, and Limits

DeWu Technology

Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDAGPU programmingPerformance Optimization

0 likes · 22 min read

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

AntTech

Nov 16, 2024 · Information Security

WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025

Ant Group’s Computing Systems Lab announced that its GPU‑accelerated fully homomorphic encryption framework WarpDrive, which exploits Tensor and CUDA cores for high‑throughput NTT operations and parallel kernel designs, has been accepted as a paper at the IEEE HPCA 2025 conference.

CUDAFully Homomorphic EncryptionGPU

0 likes · 4 min read

WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025

Alibaba Cloud Native

Aug 4, 2024 · Artificial Intelligence

Step‑by‑Step Guide: Deploy the Roop AI Face‑Swap Project with Tongyi Lingma

This tutorial walks you through cloning the open‑source Roop AI face‑swap repository, setting up a conda environment, installing CUDA‑enabled PyTorch, configuring FFmpeg, and using the Tongyi Lingma AI coding assistant to explore code, resolve errors, and fine‑tune runtime parameters for successful video swapping.

AI face swapCUDAFFmpeg

0 likes · 7 min read

Step‑by‑Step Guide: Deploy the Roop AI Face‑Swap Project with Tongyi Lingma

DevOps

Jun 13, 2024 · R&D Management

Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA

In this interview, NVIDIA founder Jensen Huang shares his management philosophy, emphasizing the value of tackling difficult tasks, maintaining a small yet empowered team, avoiding layoffs, fostering a zero‑market mindset, navigating the early challenges of CUDA, and leveraging AI to drive future innovation.

AICUDALeadership

0 likes · 12 min read

Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA

IT Services Circle

Jun 7, 2024 · Artificial Intelligence

Reader – One‑Click URL to LLM‑Friendly Input, and llm.c – C/CUDA LLM Training Tool

This article introduces Reader, an open‑source Jina AI tool that converts any web URL into a format optimized for large language models, and llm.c, a minimalist C and CUDA project that demonstrates how to train a GPT‑2‑style LLM from scratch.

C#CUDAJina AI

0 likes · 4 min read

Reader – One‑Click URL to LLM‑Friendly Input, and llm.c – C/CUDA LLM Training Tool

Python Programming Learning Circle

Jun 6, 2024 · Fundamentals

Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support

This article introduces Numba, a Python just‑in‑time compiler, explains why it is advantageous over alternatives, demonstrates how to apply its @jit, @njit, @vectorize and other decorators, and shows how to run accelerated code on CPUs and GPUs using CUDA.

CUDAGPUPython

0 likes · 9 min read

Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support

IT Services Circle

May 2, 2024 · Artificial Intelligence

LLM.c: A 1000‑Line C Implementation for Training GPT‑2

Andrej Karpathy’s LLM.c project demonstrates how a compact, pure‑C (and CUDA) codebase of roughly 1000 lines can train a GPT‑2 model, covering data preparation, memory management, layer implementations, compilation, and practical tips for running and testing the model on CPUs and GPUs.

AIC#CUDA

0 likes · 10 min read

LLM.c: A 1000‑Line C Implementation for Training GPT‑2

NewBeeNLP

Apr 11, 2024 · Artificial Intelligence

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Andrej Karpathy released LLM.C, a pure C/CUDA implementation that trains GPT‑2‑style models in about 1,000 lines of code, detailing manual forward/backward passes, memory allocation tricks, SIMD CPU acceleration, CUDA porting, and migration tutorials, while comparing it to PyTorch and discussing broader LLM OS implications.

C ProgrammingCUDAGPT

0 likes · 6 min read

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Architects' Tech Alliance

Jan 14, 2024 · Industry Insights

Can Chinese GPUs Close the Gap with NVIDIA? 2023 GPGPU Landscape Analysis

2023 GPGPU research framework analysis reveals that while Chinese GPUs like BR100 and TianGai100 can match or exceed NVIDIA A100 in FP32, they still lag in FP64 and INT8 performance, and the domestic software ecosystem based on OpenCL trails far behind NVIDIA's CUDA, shaping short‑and‑term market dynamics.

AI computingCUDAChina

0 likes · 6 min read

Can Chinese GPUs Close the Gap with NVIDIA? 2023 GPGPU Landscape Analysis

Architects' Tech Alliance

Jun 20, 2023 · Fundamentals

Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing

NVIDIA's new DOCA GPUNetIO library enables GPU‑initiated communication, allowing packets to be received directly into GPU memory, processed by CUDA kernels, and sent without CPU involvement, offering lower latency, higher scalability, and detailed pipeline examples including IP checksum, HTTP filtering, traffic forwarding, and 5G Aerial SDK integration.

5GCUDADOCA

0 likes · 19 min read

Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing

High Availability Architecture

Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

CUDAModel QuantizationOCR

0 likes · 10 min read

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

DeWu Technology

Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference

0 likes · 14 min read

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

Python Programming Learning Circle

Mar 7, 2023 · Fundamentals

Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support

This article introduces Numba, a Just‑in‑Time compiler for Python that transforms functions into fast machine code using LLVM, explains why it lets you stay in pure Python, demonstrates basic @jit/@njit usage, advanced decorators, GPU execution with CUDA, and interoperability with C/C++ libraries.

CUDAGPUJIT

0 likes · 11 min read

Python Programming Learning Circle

Nov 15, 2022 · Fundamentals

A Comprehensive Guide to Using Numba for Python JIT Compilation

This article introduces Numba, a Python Just-in-time compiler, explains why it is advantageous over alternatives, demonstrates how to apply its decorators such as @jit, @njit, @vectorize, and @cuda for CPU and GPU acceleration, and provides practical code examples and tips for optimal performance.

CUDAGPUJIT

0 likes · 10 min read

A Comprehensive Guide to Using Numba for Python JIT Compilation

Kuaishou Large Model

Aug 26, 2022 · Cloud Computing

Boost Cloud Rendering with NVIDIA GPU: Hardware Encoding & Decoding Using FFmpeg

This article explains how to leverage server‑side GPUs for hardware‑accelerated H.264 encoding and decoding with FFmpeg, covering installation, key API calls, format conversion to OpenGL textures, multi‑process considerations, and performance optimizations for cloud‑rendered visual effects.

CUDAFFmpegGPU Acceleration

0 likes · 11 min read

Boost Cloud Rendering with NVIDIA GPU: Hardware Encoding & Decoding Using FFmpeg

Shopee Tech Team

Jun 2, 2022 · Backend Development

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

The Shopee Off‑Platform Ads team built a GPU‑accelerated Creative Rendering System that uses a four‑layer architecture, CGO‑bridged C/C++ kernels, and template caching to process billions of product images daily, achieving roughly ten‑fold speedup, half the cost, and far reduced rack space while handling high concurrency.

AdvertisingCGOCUDA

0 likes · 23 min read

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

MaGe Linux Operations

May 20, 2022 · Operations

Why NVIDIA’s Open‑Source Linux GPU Kernel Driver Is a Game‑Changer

NVIDIA has finally open‑sourced its Linux GPU kernel driver, a landmark move that promises tighter OS integration, easier debugging, and broader support for Turing and Ampere GPUs, while also reshaping the relationship between proprietary drivers, the Nouveau project, and major Linux distributions.

CUDAGPUKernel Driver

0 likes · 9 min read

Why NVIDIA’s Open‑Source Linux GPU Kernel Driver Is a Game‑Changer

360 Quality & Efficiency

Oct 22, 2021 · Artificial Intelligence

Troubleshooting CUDA Availability for PyTorch: Installation and Version Compatibility Guide

The article walks through diagnosing why PyTorch cannot access the GPU, reinstalling CUDA, selecting matching PyTorch builds, adjusting versions, and verifying that CUDA becomes available for accelerated training.

CUDAGPUInstallation

0 likes · 4 min read

Troubleshooting CUDA Availability for PyTorch: Installation and Version Compatibility Guide

Architects' Tech Alliance

Aug 29, 2021 · Fundamentals

GPU Overview: History, Architecture, Processing Workflow, and Acceleration Technologies (CUDA & OpenCL)

This article provides a comprehensive overview of GPUs, covering their history, architecture, processing workflow, and acceleration technologies such as CUDA and OpenCL, while comparing GPU and CPU designs and offering resources for further study.

CUDAGPUOpenCL

0 likes · 14 min read

GPU Overview: History, Architecture, Processing Workflow, and Acceleration Technologies (CUDA & OpenCL)

Liangxu Linux

Aug 17, 2021 · Cloud Native

How to Enable GPU Acceleration in Docker on Linux

This guide walks you through installing NVIDIA drivers, CUDA, and nvidia-docker2 on a Linux host, configuring Docker to access the GPU, and verifying the setup with commands and sample TensorFlow/PyTorch code, enabling deep‑learning workloads inside containers.

CUDADockerGPU

0 likes · 7 min read

How to Enable GPU Acceleration in Docker on Linux

MaGe Linux Operations

Jul 26, 2021 · Fundamentals

Boost NumPy Performance 10× with CuPy: GPU Acceleration Guide

This article explains how CuPy mirrors NumPy's API to run array and matrix operations on NVIDIA GPUs, providing step‑by‑step installation, code examples, and benchmark results that demonstrate speedups ranging from 10× to over 700× compared to CPU‑only NumPy.

CUDACuPyGPU Acceleration

0 likes · 5 min read

Boost NumPy Performance 10× with CuPy: GPU Acceleration Guide

TiPaiPai Technical Team

Jun 25, 2021 · Artificial Intelligence

Mastering TensorRT: Deploy Deep Learning Models Efficiently

This article introduces TensorRT, explains its deployment workflow from model training to engine generation, shows how to register custom operators for ONNX and create TensorRT plugins, and explores deformable convolution (DCN) implementation strategies for high‑performance AI inference.

AI inferenceCUDACustom Operators

0 likes · 8 min read

Mastering TensorRT: Deploy Deep Learning Models Efficiently

DataFunTalk

Jun 13, 2021 · Artificial Intelligence

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

CUDAGPU virtualizationNVIDIA

0 likes · 16 min read

GPU Virtual Sharing for AI Inference Services on Kubernetes

iQIYI Technical Product Team

May 28, 2021 · Artificial Intelligence

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.

AI inferenceCUDAGPU virtualization

0 likes · 16 min read

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

Python Programming Learning Circle

Apr 28, 2021 · Fundamentals

Getting Started with Numba: Python JIT Compilation and GPU Acceleration

This article introduces Numba, a Python just‑in‑time compiler, explains why it’s advantageous over alternatives, and provides detailed guidance on using its decorators such as @jit, @njit, @vectorize, and @cuda.jit, including code examples for CPU and GPU acceleration.

CUDAGPUJIT

0 likes · 12 min read

Getting Started with Numba: Python JIT Compilation and GPU Acceleration

Architects' Tech Alliance

Mar 20, 2021 · Fundamentals

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

This article outlines the progression of NVIDIA GPU architectures—from the early Fermi and Kepler designs through Maxwell, Pascal, Volta, Turing, and the latest Ampere—detailing compute capabilities, SM structures, FP64/FP32 ratios, Tensor Core introductions, and their impact on AI and high‑performance computing.

AICUDAGPU architecture

0 likes · 19 min read

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

Architects' Tech Alliance

Mar 15, 2021 · Artificial Intelligence

Evolution of NVIDIA GPU Architectures from Fermi to Ampere

This article provides a comprehensive overview of NVIDIA's GPU architecture evolution—covering Fermi, Kepler, Maxwell, Pascal, Volta, Turing, and Ampere—detailing compute capabilities, SM structures, specialized units such as Tensor Cores, and their impact on AI and high‑performance computing workloads.

AICUDAGPU

0 likes · 19 min read

Open Source Linux

Feb 8, 2021 · Operations

How to Set Up Docker with NVIDIA GPU for Deep Learning on Linux

This guide walks you through installing NVIDIA drivers, CUDA, and nvidia-docker2 on a Linux host and configuring Docker containers to access the GPU, including verification steps and sample TensorFlow and PyTorch commands.

CUDADockerGPU

0 likes · 8 min read

How to Set Up Docker with NVIDIA GPU for Deep Learning on Linux

Programmer DD

Dec 6, 2020 · Cloud Native

Enable GPU Support in Kubernetes with Containerd and NVIDIA Runtime

This guide walks through installing NVIDIA drivers, CUDA toolkit, nvidia-container-runtime, configuring Containerd, deploying the NVIDIA device plugin, and testing GPU access inside Kubernetes pods, providing a complete solution for GPU workloads on containerd‑based clusters.

CUDADevice PluginsGPU

0 likes · 11 min read

Enable GPU Support in Kubernetes with Containerd and NVIDIA Runtime

Tencent Cloud Developer

Jul 7, 2020 · Artificial Intelligence

Remote Development Guide on Tencent Cloud GPU Instances: Driver, CUDA, cuDNN Installation and PyCharm/Jupyter Integration

This guide walks researchers through selecting a Tencent Cloud GN7 GPU instance, installing NVIDIA drivers, CUDA 10.2, cuDNN, setting up PyTorch and Jupyter, and configuring remote development with PyCharm, enabling efficient, cost‑effective AI development on a Tesla T4 GPU server.

AICUDAGPU

0 likes · 12 min read

Remote Development Guide on Tencent Cloud GPU Instances: Driver, CUDA, cuDNN Installation and PyCharm/Jupyter Integration

TAL Education Technology

May 14, 2020 · Artificial Intelligence

An Introduction to GPU Computing and CUDA Architecture

This article provides a concise overview of GPU computing fundamentals, covering GPU hardware components, memory hierarchy, parallel execution models, and the CUDA programming framework, illustrating how CPUs and GPUs cooperate in heterogeneous computing environments.

CUDACUDA programmingGPU

0 likes · 16 min read

An Introduction to GPU Computing and CUDA Architecture

Architects' Tech Alliance

May 5, 2020 · Fundamentals

Why Heterogeneous Computing Is the Future: CPUs, GPUs, FPGAs, and More Explained

The article provides a comprehensive overview of heterogeneous computing, detailing its definition, real‑world system examples, performance advantages, key programming frameworks such as OpenCL and CUDA, industry trends like SOC integration, and a comparative analysis of CPUs, GPUs, FPGAs and ASICs.

CPUCUDAFPGA

0 likes · 9 min read

Why Heterogeneous Computing Is the Future: CPUs, GPUs, FPGAs, and More Explained

Architects' Tech Alliance

Dec 28, 2019 · Artificial Intelligence

Understanding CPU vs GPU, GPU Parameters, and NVIDIA Architectures for AI and High‑Performance Computing

The article explains how CPUs and GPUs differ in architecture and workload handling, details key GPU specifications such as CUDA cores, memory bandwidth and floating‑point precision, reviews NVIDIA's product families and architectural evolution, and highlights the role of GPUs in deep learning training and inference while also mentioning a related technical ebook promotion.

AICPUCUDA

0 likes · 13 min read

Understanding CPU vs GPU, GPU Parameters, and NVIDIA Architectures for AI and High‑Performance Computing