Artificial Intelligence 17 min read

When Go Meets GPU: A Hands‑On Guide to Unlocking Thousand‑Fold Compute with CUDA

This article walks Go developers through the fundamentals of GPU architecture and CUDA, demonstrates a complete CGO‑based matrix‑multiplication project, offers performance‑tuning tips such as minimizing PCIe transfers and leveraging shared memory, and presents a PureGo alternative for seamless Go‑GPU integration.

TonyBai

Jan 21, 2026

When Go Meets GPU: A Hands‑On Guide to Unlocking Thousand‑Fold Compute with CUDA

Why Go Developers Should Care About GPUs

As Moore's law slows, CPU single‑core performance stalls while GPUs provide thousands of lightweight cores for massive parallelism, making them ideal for graphics, video transcoding, AI inference, scientific simulations, and cryptographic hashing.

GPU Architecture and CUDA Programming Model Overview

Sam Burns likens a CPU to a handful of expert specialists and a GPU to a disciplined army. In CUDA, the kernel is the function executed on the GPU. Launching a kernel starts thousands of threads, each identified by threadIdx to process a distinct data element.

1. Thread Model – From Thread to Grid

Thread : the smallest execution unit, similar to a CPU thread but extremely lightweight.

Block : a group of threads that share fast __syncthreads() ‑synchronised shared memory.

Grid : the collection of all blocks that run the same kernel.

2. Memory Hierarchy – Speed vs. Capacity

Registers : fastest, private to each thread, limited in number.

Shared Memory : L1‑cache‑level speed, private to a block, crucial for reducing global memory traffic.

Global Memory : large (e.g., 24 GB GDDR6X) but high latency.

Constant Memory : cached read‑only memory for broadcasting values.

Efficient CUDA code keeps data in registers and shared memory as much as possible.

Go + CUDA Hands‑On: Matrix Multiplication

The example builds a Go project that offloads a 1024 × 1024 matrix multiplication to the GPU via CGO.

1. Project Layout

go-cuda-cgo-demo/
├── main.go       # Go entry point (CGO bridge)
├── matrix.cu     # CUDA kernel (C++ code running on GPU)
└── matrix.h     # C header exposing the kernel wrapper

2. CUDA Kernel (matrix.cu)

#include <cuda_runtime.h>
#include <stdio.h>

// Each thread computes one element of C
__global__ void matrixMulKernel(float *a, float *b, float *c, int width) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < width && col < width) {
        float sum = 0;
        for (int k = 0; k < width; k++) {
            sum += a[row * width + k] * b[k * width + col];
        }
        c[row * width + col] = sum;
    }
}

extern "C" {
    // Wrapper called from Go
    void runMatrixMul(float *h_a, float *h_b, float *h_c, int width) {
        int size = width * width * sizeof(float);
        float *d_a, *d_b, *d_c;
        cudaMalloc((void**)&d_a, size);
        cudaMalloc((void**)&d_b, size);
        cudaMalloc((void**)&d_c, size);
        cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
        dim3 threadsPerBlock(16, 16);
        dim3 numBlocks((width + threadsPerBlock.x - 1) / threadsPerBlock.x,
                      (width + threadsPerBlock.y - 1) / threadsPerBlock.y);
        matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, width);
        cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
        cudaFree(d_a);
        cudaFree(d_b);
        cudaFree(d_c);
    }
}

3. Header (matrix.h)

#ifndef MATRIX_H
#define MATRIX_H

void runMatrixMul(float *a, float *b, float *c, int width);

#endif

4. Go Main Program (main.go)

package main

/*
#cgo LDFLAGS: -L. -lmatrix -L/usr/local/cuda/lib64 -lcudart
#include "matrix.h"
*/
import "C"
import (
    "fmt"
    "math/rand"
    "time"
    "unsafe"
)

const width = 1024 // 1024×1024 matrix (1 M elements)

func main() {
    size := width * width
    h_a := make([]float32, size)
    h_b := make([]float32, size)
    h_c := make([]float32, size)
    rand.Seed(time.Now().UnixNano())
    for i := 0; i < size; i++ {
        h_a[i] = rand.Float32()
        h_b[i] = rand.Float32()
    }
    fmt.Printf("Starting Matrix Multiplication (%dx%d) on GPU...
", width, width)
    start := time.Now()
    C.runMatrixMul((*C.float)(unsafe.Pointer(&h_a[0])),
        (*C.float)(unsafe.Pointer(&h_b[0])),
        (*C.float)(unsafe.Pointer(&h_c[0])),
        C.int(width))
    elapsed := time.Since(start)
    fmt.Printf("Done. Time elapsed: %v
", elapsed)
    fmt.Printf("Result[0][0] = %f
", h_c[0])
}

5. Build & Run

Compile the CUDA code:

nvcc -c matrix.cu -o matrix.o
ar rcs libmatrix.a matrix.o

Build the Go binary linking the static library: go build -o gpu-cgo-demo main.go Execute: ./gpu-cgo-demo Typical output:

Starting Matrix Multiplication (1024x1024) on GPU...
Done. Time elapsed: 611.815451ms
Result[0][0] = 262.440918

Performance Tuning – From Working to Optimal

Even after the code runs, Sam recommends profiling with NVIDIA Nsight Systems ( nsys) to spot bottlenecks. The dominant slowdown is often PCIe data transfer.

Reduce Transfers : Move all data to the GPU once, perform many kernels, then retrieve results.

Leverage Shared Memory : Implement tiling so each block loads sub‑matrices into shared memory, dramatically cutting global memory bandwidth usage.

Appendix: PureGo – A CGO‑Free Alternative

PureGo lets Go load a shared library (.so/.dll) at runtime, avoiding CGO’s compilation overhead and external toolchain dependencies.

1. Build a Shared CUDA Library

nvcc -shared -Xcompiler -fPIC matrix.cu -o libmatrix.so

2. PureGo Main Program

package main

import (
    "fmt"
    "math/rand"
    "runtime"
    "time"
    "github.com/ebitengine/purego"
)

const width = 1024

func main() {
    libMatrix, err := purego.Dlopen("libmatrix.so", purego.RTLD_NOW|purego.RTLD_GLOBAL)
    if err != nil { panic(err) }
    // Load CUDA runtime as well
    _, err = purego.Dlopen("/usr/local/cuda/lib64/libcudart.so", purego.RTLD_NOW|purego.RTLD_GLOBAL)
    if err != nil { panic(err) }
    var runMatrixMul func(a, b, c *float32, w int)
    purego.RegisterLibFunc(&runMatrixMul, libMatrix, "runMatrixMul")
    size := width * width
    h_a := make([]float32, size)
    h_b := make([]float32, size)
    h_c := make([]float32, size)
    rand.Seed(time.Now().UnixNano())
    for i := 0; i < size; i++ { h_a[i] = rand.Float32(); h_b[i] = rand.Float32() }
    fmt.Println("Starting Matrix Multiplication via PureGo...")
    start := time.Now()
    runMatrixMul(&h_a[0], &h_b[0], &h_c[0], width)
    fmt.Printf("Done. Time: %v
", time.Since(start))
    fmt.Printf("Result[0][0] = %f
", h_c[0])
    // Keep slices alive for the duration of the call
    runtime.KeepAlive(h_a)
    runtime.KeepAlive(h_b)
    runtime.KeepAlive(h_c)
}

3. Run the PureGo Version

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
CGO_ENABLED=0 go run main.go
Starting Matrix Multiplication via PureGo...
Done. Time: 584.397195ms
Result[0][0] = 260.088806

Advantages :

Fast compilation – no CGO overhead.

Zero external build dependencies – only the shared library is needed at runtime, which is handy for lightweight CI/CD pipelines.

Caveats : PureGo bypasses some of CGO’s type‑safety checks; developers must manually ensure memory stays alive (e.g., using runtime.KeepAlive).

Conclusion

Combining Go with CUDA opens a high‑performance computing path for Go applications, allowing the familiar Go development experience to offload heavy numeric work to GPUs. Whether using CGO or the PureGo approach, developers can achieve substantial speedups for data‑intensive tasks such as image processing, recommendation algorithms, or cryptographic workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Go CUDA Matrix multiplication GPU computing CGO PureGo

Written by

TonyBai

Tony Bai's tech world (tonybai.com). Not satisfied with just "knowing how", we strive for mastery. Focused on Go language internals, high-quality engineering practices, and cloud‑native architecture, exploring cutting‑edge intersections of Go and AI. Gophers who pursue technology are welcome—follow me and evolve with Go.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Go Developers Should Care About GPUs

GPU Architecture and CUDA Programming Model Overview

1. Thread Model – From Thread to Grid

2. Memory Hierarchy – Speed vs. Capacity

Go + CUDA Hands‑On: Matrix Multiplication

1. Project Layout

2. CUDA Kernel (matrix.cu)

3. Header (matrix.h)

4. Go Main Program (main.go)

5. Build & Run

Performance Tuning – From Working to Optimal

Appendix: PureGo – A CGO‑Free Alternative

1. Build a Shared CUDA Library

2. PureGo Main Program

3. Run the PureGo Version

Conclusion

TonyBai

How this landed with the community

Was this worth your time?

0 Comments

Go + CUDA Hands‑On: Matrix Multiplication