Tag

CUDA

0 views collected around this technical thread.

Python Programming Learning Circle
Python Programming Learning Circle
Jun 2, 2025 · Artificial Intelligence

NVIDIA Adds Native Python Support to CUDA – What It Means for Developers

NVIDIA announced at GTC 2025 that CUDA will now natively support Python, allowing developers to write GPU‑accelerated code directly in Python without C/C++ knowledge, introducing new APIs, libraries, JIT compilation, performance tools, and a tile‑based programming model that aligns with Python’s array‑centric workflow.

AIAccelerated ComputingCUDA
0 likes · 7 min read
NVIDIA Adds Native Python Support to CUDA – What It Means for Developers
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 1, 2025 · Artificial Intelligence

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.

BenchmarkCUDADeepGEMM
0 likes · 15 min read
DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?
Tencent Technical Engineering
Tencent Technical Engineering
Mar 31, 2025 · Artificial Intelligence

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

This step‑by‑step tutorial shows how to set up CUDA 12.4, install required packages, prepare a JSON dataset and custom reward, troubleshoot out‑of‑memory errors, and launch DeepSeek R1 training on an 8‑GPU A100 cluster using Accelerate, Deepspeed zero‑3 and vLLM configurations.

A100CUDADeepSeek
0 likes · 9 min read
Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems
Tencent Technical Engineering
Tencent Technical Engineering
Mar 21, 2025 · Fundamentals

Fundamentals of GPU Architecture and Programming

The article explains GPU fundamentals—from the end of Dennard scaling and why GPUs excel in parallel throughput, through CUDA programming basics like the SAXPY kernel and SIMT versus SIMD execution, to the evolution of the SIMT stack, modern scheduling, and a three‑step core architecture design.

CUDAGPUGPU programming
0 likes · 42 min read
Fundamentals of GPU Architecture and Programming
AntTech
AntTech
Nov 16, 2024 · Information Security

WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025

Ant Group’s Computing Systems Lab announced that its GPU‑accelerated fully homomorphic encryption framework WarpDrive, which exploits Tensor and CUDA cores for high‑throughput NTT operations and parallel kernel designs, has been accepted as a paper at the IEEE HPCA 2025 conference.

CUDAFully Homomorphic EncryptionGPU
0 likes · 4 min read
WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025
DevOps
DevOps
Jun 13, 2024 · R&D Management

Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA

In this interview, NVIDIA founder Jensen Huang shares his management philosophy, emphasizing the value of tackling difficult tasks, maintaining a small yet empowered team, avoiding layoffs, fostering a zero‑market mindset, navigating the early challenges of CUDA, and leveraging AI to drive future innovation.

AICUDAInnovation
0 likes · 12 min read
Jensen Huang on Management Philosophy, Team Structure, and Innovation at NVIDIA
Python Programming Learning Circle
Python Programming Learning Circle
Jun 6, 2024 · Fundamentals

Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support

This article introduces Numba, a Python just‑in‑time compiler, explains why it is advantageous over alternatives, demonstrates how to apply its @jit, @njit, @vectorize and other decorators, and shows how to run accelerated code on CPUs and GPUs using CUDA.

CUDAGPUJIT
0 likes · 9 min read
Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support
IT Services Circle
IT Services Circle
May 2, 2024 · Artificial Intelligence

LLM.c: A 1000‑Line C Implementation for Training GPT‑2

Andrej Karpathy’s LLM.c project demonstrates how a compact, pure‑C (and CUDA) codebase of roughly 1000 lines can train a GPT‑2 model, covering data preparation, memory management, layer implementations, compilation, and practical tips for running and testing the model on CPUs and GPUs.

AIC++CUDA
0 likes · 10 min read
LLM.c: A 1000‑Line C Implementation for Training GPT‑2
Architects' Tech Alliance
Architects' Tech Alliance
Jun 20, 2023 · Fundamentals

Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing

NVIDIA's new DOCA GPUNetIO library enables GPU‑initiated communication, allowing packets to be received directly into GPU memory, processed by CUDA kernels, and sent without CPU involvement, offering lower latency, higher scalability, and detailed pipeline examples including IP checksum, HTTP filtering, traffic forwarding, and 5G Aerial SDK integration.

5GCUDADOCA
0 likes · 19 min read
Introducing NVIDIA DOCA GPUNetIO: GPU‑Initiated Communication for Real‑Time Packet Processing
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI optimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
DeWu Technology
DeWu Technology
Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference
0 likes · 14 min read
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT
Python Programming Learning Circle
Python Programming Learning Circle
Mar 7, 2023 · Fundamentals

Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support

This article introduces Numba, a Just‑in‑Time compiler for Python that transforms functions into fast machine code using LLVM, explains why it lets you stay in pure Python, demonstrates basic @jit/@njit usage, advanced decorators, GPU execution with CUDA, and interoperability with C/C++ libraries.

CUDADecoratorsGPU
0 likes · 11 min read
Accelerating Python with Numba: JIT Compilation, Decorators, and GPU Support
Python Programming Learning Circle
Python Programming Learning Circle
Nov 15, 2022 · Fundamentals

A Comprehensive Guide to Using Numba for Python JIT Compilation

This article introduces Numba, a Python Just-in-time compiler, explains why it is advantageous over alternatives, demonstrates how to apply its decorators such as @jit, @njit, @vectorize, and @cuda for CPU and GPU acceleration, and provides practical code examples and tips for optimal performance.

CUDAGPUJIT
0 likes · 10 min read
A Comprehensive Guide to Using Numba for Python JIT Compilation
Kuaishou Large Model
Kuaishou Large Model
Aug 26, 2022 · Cloud Computing

Boost Cloud Rendering with NVIDIA GPU: Hardware Encoding & Decoding Using FFmpeg

This article explains how to leverage server‑side GPUs for hardware‑accelerated H.264 encoding and decoding with FFmpeg, covering installation, key API calls, format conversion to OpenGL textures, multi‑process considerations, and performance optimizations for cloud‑rendered visual effects.

CUDAFFmpegGPU Acceleration
0 likes · 11 min read
Boost Cloud Rendering with NVIDIA GPU: Hardware Encoding & Decoding Using FFmpeg
Shopee Tech Team
Shopee Tech Team
Jun 2, 2022 · Backend Development

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

The Shopee Off‑Platform Ads team built a GPU‑accelerated Creative Rendering System that uses a four‑layer architecture, CGO‑bridged C/C++ kernels, and template caching to process billions of product images daily, achieving roughly ten‑fold speedup, half the cost, and far reduced rack space while handling high concurrency.

CUDAGPUGo
0 likes · 23 min read
Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads
DataFunTalk
DataFunTalk
Jun 13, 2021 · Artificial Intelligence

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

CUDAGPU virtualizationKubernetes
0 likes · 16 min read
GPU Virtual Sharing for AI Inference Services on Kubernetes
iQIYI Technical Product Team
iQIYI Technical Product Team
May 28, 2021 · Artificial Intelligence

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.

AI inferenceCUDAGPU virtualization
0 likes · 16 min read
iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling