Artificial Intelligence 30 min read

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

The article walks through a step‑by‑step optimization of a simple elementwise addition kernel (C = A + B) on HyperAI's RTX 5090 cloud instance, covering FP32 baseline, vectorized FP32, several FP16 variants, benchmark methodology, performance results, and the reasoning behind thread‑block sizing.

HyperAI Super Neural

Feb 4, 2026

Practical Experience: Optimizing Elementwise Operators on HyperAI Cloud Compute Platform

Introduction

HyperAI’s cloud compute platform provides on‑demand high‑performance GPUs at a low price, making it convenient for developers to run heavy AI workloads. This article shares a real‑world experience of optimizing an elementwise addition operator (C = A + B) to approach native PyTorch performance, i.e., the memory‑bandwidth limit of the hardware.

Core Goal and Difficulty

Goal: Transform a naïve elementwise add implementation into a version that fully utilizes the GPU’s memory bandwidth.

Difficulty: Elementwise kernels are typically memory‑bound; the bottleneck lies in the balance between instruction‑issue throughput and memory‑transfer volume.

Compute is not the bottleneck (addition is extremely fast on GPU).

The limiting factor is the instruction‑dispatch side and the memory‑move side.

Optimization principle: use the fewest instructions to move the most bytes.

Experiment Environment

All benchmarks were run on a HyperAI instance equipped with:

GPU: NVIDIA RTX 5090 (32 GB VRAM)

RAM: 40 GB

Software: PyTorch 2.8 / CUDA 12.8

The Jupyter environment was used to compile and run the kernels.

Stage 1: FP32 Optimization Series

Version 1 – FP32 Baseline (scalar)

Naïve kernel launches one 32‑bit load per thread, moving 128 bytes per warp instruction. This yields low instruction efficiency because each instruction transports only 128 bytes.

%%writefile v1_f32.cu
#include <torch/extension.h>
#include <cuda_runtime.h>

__global__ void elementwise_add_f32_kernel(float *a, float *b, float *c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        c[idx] = a[idx] + b[idx];
    }
}

void elementwise_add_f32(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256;
    int blocks_per_grid = (N + threads_per_block - 1) / threads_per_block;
    elementwise_add_f32_kernel<<<blocks_per_grid, threads_per_block>>>(a.data_ptr<float>(), b.data_ptr<float>(), c.data_ptr<float>(), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f32, "FP32 Add"); }

Version 2 – FP32×4 Vectorized

Uses float4 to force a 128‑bit load, allowing each instruction to move 512 bytes (four times the baseline). Instruction efficiency improves fourfold, shifting the bottleneck to memory bandwidth.

%%writefile v2_f32x4.cu
#include <torch/extension.h>
#include <cuda_runtime.h>
#define FLOAT4(value) (reinterpret_cast<float4 *>(&(value))[0])

__global__ void elementwise_add_f32x4_kernel(float *a, float *b, float *c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int idx = 4 * tid;
    if (idx + 3 < N) {
        float4 reg_a = FLOAT4(a[idx]);
        float4 reg_b = FLOAT4(b[idx]);
        float4 reg_c;
        reg_c.x = reg_a.x + reg_b.x;
        reg_c.y = reg_a.y + reg_b.y;
        reg_c.z = reg_a.z + reg_b.z;
        reg_c.w = reg_a.w + reg_b.w;
        FLOAT4(c[idx]) = reg_c;
    } else {
        for (int i = 0; i < 4; ++i) {
            if (idx + i < N) c[idx + i] = a[idx + i] + b[idx + i];
        }
    }
}

void elementwise_add_f32x4(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256 / 4;
    int blocks_per_grid = (N + 256 - 1) / 256;
    elementwise_add_f32x4_kernel<<<blocks_per_grid, threads_per_block>>>(a.data_ptr<float>(), b.data_ptr<float>(), c.data_ptr<float>(), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f32x4, "FP32x4 Add"); }

Stage 2: FP16 Optimization Series

Version 3 – FP16 Baseline (scalar)

Switches to half to halve memory traffic, but each instruction still moves only 64 bytes, making the instruction‑issue unit the bottleneck.

%%writefile v3_f16.cu
#include <torch/extension.h>
#include <cuda_fp16.h>

__global__ void elementwise_add_f16_kernel(half *a, half *b, half *c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        c[idx] = __hadd(a[idx], b[idx]);
    }
}

void elementwise_add_f16(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256;
    int blocks_per_grid = (N + threads_per_block - 1) / threads_per_block;
    elementwise_add_f16_kernel<<<blocks_per_grid, threads_per_block>>>(reinterpret_cast<half*>(a.data_ptr<at::Half>()), reinterpret_cast<half*>(b.data_ptr<at::Half>()), reinterpret_cast<half*>(c.data_ptr<at::Half>()), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f16, "FP16 Add"); }

Version 4 – FP16 Vectorized (half2)

Uses half2 (4 bytes) and the SIMD __hadd2 instruction, achieving the same instruction efficiency as the FP32 baseline but still far from the FP32×4 vectorized version.

%%writefile v4_f16x2.cu
#include <torch/extension.h>
#include <cuda_fp16.h>
#define HALF2(value) (reinterpret_cast<half2 *>(&(value))[0])

__global__ void elementwise_add_f16x2_kernel(half *a, half *b, half *c, int N) {
    int idx = 2 * (blockIdx.x * blockDim.x + threadIdx.x);
    if (idx + 1 < N) {
        half2 reg_a = HALF2(a[idx]);
        half2 reg_b = HALF2(b[idx]);
        half2 reg_c = __hadd2(reg_a, reg_b);
        HALF2(c[idx]) = reg_c;
    } else if (idx < N) {
        c[idx] = __hadd(a[idx], b[idx]);
    }
}

void elementwise_add_f16x2(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256 / 2;
    int blocks_per_grid = (N + 256 - 1) / 256;
    elementwise_add_f16x2_kernel<<<blocks_per_grid, threads_per_block>>>(reinterpret_cast<half*>(a.data_ptr<at::Half>()), reinterpret_cast<half*>(b.data_ptr<at::Half>()), reinterpret_cast<half*>(c.data_ptr<at::Half>()), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f16x2, "FP16x2 Add"); }

Version 5 – FP16×8 Unroll

Manually unrolls the loop so that each thread processes eight half values (four half2 registers). The scheduler now emits four 32‑bit load instructions back‑to‑back, improving instruction‑level parallelism (ILP) and latency hiding.

%%writefile v5_f16x8.cu
#include <torch/extension.h>
#include <cuda_fp16.h>
#define HALF2(value) (reinterpret_cast<half2 *>(&(value))[0])

__global__ void elementwise_add_f16x8_kernel(half *a, half *b, half *c, int N) {
    int idx = 8 * (blockIdx.x * blockDim.x + threadIdx.x);
    if (idx + 7 < N) {
        half2 ra0 = HALF2(a[idx + 0]);
        half2 ra1 = HALF2(a[idx + 2]);
        half2 ra2 = HALF2(a[idx + 4]);
        half2 ra3 = HALF2(a[idx + 6]);
        half2 rb0 = HALF2(b[idx + 0]);
        half2 rb1 = HALF2(b[idx + 2]);
        half2 rb2 = HALF2(b[idx + 4]);
        half2 rb3 = HALF2(b[idx + 6]);
        HALF2(c[idx + 0]) = __hadd2(ra0, rb0);
        HALF2(c[idx + 2]) = __hadd2(ra1, rb1);
        HALF2(c[idx + 4]) = __hadd2(ra2, rb2);
        HALF2(c[idx + 6]) = __hadd2(ra3, rb3);
    } else {
        for (int i = 0; i < 8; ++i) {
            if (idx + i < N) c[idx + i] = __hadd(a[idx + i], b[idx + i]);
        }
    }
}

void elementwise_add_f16x8(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256 / 8;
    int blocks_per_grid = (N + 256 - 1) / 256;
    elementwise_add_f16x8_kernel<<<blocks_per_grid, threads_per_block>>>(reinterpret_cast<half*>(a.data_ptr<at::Half>()), reinterpret_cast<half*>(b.data_ptr<at::Half>()), reinterpret_cast<half*>(c.data_ptr<at::Half>()), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f16x8, "FP16x8 Add"); }

Version 6 – FP16×8 Pack (final optimization)

Combines the wide‑load trick of Version 2 with the ILP of Version 5 and adds a register‑array cache. By reinterpreting the half array as float4, a single 128‑bit load moves eight half values, and the compiler places the temporary pack_a / pack_b / pack_c arrays in registers.

%%writefile v6_f16x8_pack.cu
#include <torch/extension.h>
#include <cuda_fp16.h>
#define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
#define HALF2(value) (reinterpret_cast<half2 *>(&(value))[0])

__global__ void elementwise_add_f16x8_pack_kernel(half *a, half *b, half *c, int N) {
    int idx = 8 * (blockIdx.x * blockDim.x + threadIdx.x);
    half pack_a[8], pack_b[8], pack_c[8];
    if (idx + 7 < N) {
        LDST128BITS(pack_a[0]) = LDST128BITS(a[idx]);
        LDST128BITS(pack_b[0]) = LDST128BITS(b[idx]);
        #pragma unroll
        for (int i = 0; i < 8; i += 2) {
            HALF2(pack_c[i]) = __hadd2(HALF2(pack_a[i]), HALF2(pack_b[i]));
        }
        LDST128BITS(c[idx]) = LDST128BITS(pack_c[0]);
    } else {
        for (int i = 0; i < 8; ++i) {
            if (idx + i < N) c[idx + i] = __hadd(a[idx + i], b[idx + i]);
        }
    }
}

void elementwise_add_f16x8_pack(torch::Tensor a, torch::Tensor b, torch::Tensor c) {
    int N = a.numel();
    int threads_per_block = 256 / 8;
    int blocks_per_grid = (N + 256 - 1) / 256;
    elementwise_add_f16x8_pack_kernel<<<blocks_per_grid, threads_per_block>>>(reinterpret_cast<half*>(a.data_ptr<at::Half>()), reinterpret_cast<half*>(b.data_ptr<at::Half>()), reinterpret_cast<half*>(c.data_ptr<at::Half>()), N);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &elementwise_add_f16x8_pack, "FP16x8 Pack Add"); }

Benchmark Methodology

A Python script loads each .cu file with torch.utils.cpp_extension.load, creates FP32 and FP16 tensors, runs a warm‑up phase, then measures average execution time over 1000 iterations. Bandwidth (GB/s) is computed as 3 * N * element_size / time. Three data sizes are used to stress different parts of the memory hierarchy:

Cache‑latency test: 1 M elements (4 MB FP32) – fits entirely in L2 cache.

L2‑throughput test: 16 M elements (64 MB FP32) – exceeds L2 capacity, forcing VRAM traffic.

VRAM‑bandwidth test: 256 M elements (1 GB FP32) – fully memory‑bound.

Results on RTX 5090

Key observations:

At the 1 M element scale, all kernels take ~0.004 ms, indicating kernel launch overhead dominates (latency‑bound).

For the 16 M element case, FP32 kernels achieve ~1700 GB/s (close to the card’s raw bandwidth) because data spills out of L2. FP16 kernels stay within L2, reaching ~2890 GB/s, and PyTorch’s JIT even exceeds 6800 GB/s due to aggressive instruction‑level optimizations.

At the 256 M element scale, every implementation caps around 1570‑1580 GB/s, matching the physical GDDR7 bandwidth limit. FP16 kernels run roughly twice as fast as FP32 (≈1 ms vs 2 ms) because they move half the data while still saturating the bus, delivering a 2× end‑to‑end speedup.

Thread‑Block Size FAQ

The experiments consistently use threads_per_block = 256. The selection follows a four‑step reasoning:

Warp alignment: Block size must be a multiple of 32 to avoid wasted warps.

Occupancy floor (≥ 96%): Ensures enough active warps to hide memory latency.

Scheduling atomicity: Block size should divide the SM’s maximum thread capacity (commonly 1024, 1536, 2048), narrowing candidates to 128, 192, 256, 384, 512.

Register pressure: Large blocks (≥ 512) risk exceeding per‑block register limits, causing spills. Hence 128 and 256 are the safe “sweet spots”, with 256 being optimal for simple elementwise kernels.

Conclusion

The step‑wise optimizations demonstrate how instruction‑level parallelism, vectorized memory accesses, and clever type‑punning can push a trivial elementwise add kernel from a memory‑bound baseline to near‑hardware limits. While the final bandwidth ceiling is set by the GPU’s physical memory system, using FP16 halves the runtime, delivering a practical 2× speedup for large‑scale AI workloads.

Appendix – Jupyter Run Example

CUDA benchmark PyTorch GPU Performance kernel optimization FP16 Elementwise

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Core Goal and Difficulty

Experiment Environment

Stage 1: FP32 Optimization Series

Version 1 – FP32 Baseline (scalar)

Version 2 – FP32×4 Vectorized

Stage 2: FP16 Optimization Series

Version 3 – FP16 Baseline (scalar)

Version 4 – FP16 Vectorized (half2)

Version 5 – FP16×8 Unroll

Version 6 – FP16×8 Pack (final optimization)

Benchmark Methodology

Results on RTX 5090

Thread‑Block Size FAQ

Conclusion

Appendix – Jupyter Run Example

HyperAI Super Neural

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: FP32 Optimization Series

Version 1 – FP32 Baseline (scalar)

Version 2 – FP32×4 Vectorized

Stage 2: FP16 Optimization Series

Version 3 – FP16 Baseline (scalar)

Version 4 – FP16 Vectorized (half2)

Version 5 – FP16×8 Unroll

Version 6 – FP16×8 Pack (final optimization)

Results on RTX 5090