Fundamentals 15 min read

Can Go Harness SIMD for High‑Performance Computing? A Deep Dive

This article examines SIMD (Single Instruction Multiple Data) technology, its relevance to Go’s performance goals, the challenges of integrating SIMD into Go’s design, current standard‑library limitations, third‑party libraries, compiler support, and practical assembly examples, concluding with prospects for future Go SIMD adoption.

BirdNest Tech Talk

Feb 1, 2025

Can Go Harness SIMD for High‑Performance Computing? A Deep Dive

Background and Motivation

SIMD (Single Instruction Multiple Data) lets a single instruction operate on several data elements simultaneously. Modern CPUs expose SIMD extensions such as Intel SSE/AVX and ARM NEON, which can accelerate image processing, machine‑learning kernels, and scientific simulations by orders of magnitude.

Go’s performance goals

Go emphasizes simplicity, fast compilation, and efficient concurrency, but its numeric kernels lag behind C/C++‑based SIMD implementations. Projects that would benefit from SIMD include simdjson (https://github.com/simdjson/simdjson), billion‑integer‑per‑second decoding (https://people.csail.mit.edu/jshun/6886-s19/lectures/lecture19-1.pdf), vectorized quicksort (https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html), Hyperscan (https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-hyperscan.html), and several Go‑specific optimization case studies (https://sourcegraph.com/blog/slow-to-simd, https://gorse.io/posts/avx512-in-golang.html#convert-assembly).

Basic SIMD concepts

A scalar add processes two numbers per instruction; a SIMD add can process 4, 8 or more numbers in one instruction, increasing throughput proportionally.

Arithmetic : add, sub, mul, div

Logical : and, or, xor, not

Data movement : load, store, shuffle

Comparison : element‑wise compare → mask

Special : sqrt, abs, min, max

Common SIMD instruction sets

Intel SIMD extensions

MMX (1996) – 8×64‑bit integer registers (MM0‑MM7), no floating‑point.

SSE (1999) – 8×128‑bit XMM registers, supports 32‑bit float and integer types.

AVX (2011) – 16×256‑bit YMM registers, adds double‑precision float support.

AVX‑512 (2016) – 32×512‑bit ZMM registers, mask and broadcast operations, dominant in HPC and AI.

ARM SIMD extensions

NEON (2005) – 16×128‑bit Q registers, integer and single‑precision float.

SVE (2016) – Scalable vector length from 128‑bit to 2048‑bit, predicate registers for conditional execution, aimed at HPC/ML.

Compiler built‑ins and auto‑vectorization

GCC, Clang and MSVC expose intrinsics that map directly to SIMD instructions, allowing developers to write portable vector code without assembly. Compilers can also auto‑vectorize loops when invoked with appropriate flags, e.g.:

gcc -O3 -mavx2 -o program program.c

SIMD support in Go

Standard library

The Go standard library provides no SIMD APIs and the gc compiler does not perform auto‑vectorization. The long‑standing discussion in issue #67520 (https://github.com/golang/go/issues/67520) shows ongoing debate, often focusing on build‑tag based work‑arounds.

Third‑party libraries

kelindar/simd (https://github.com/kelindar/simd) generates vectorized math functions with clang’s auto‑vectorizer and emits them as Go PLAN9 assembly. It currently ships AVX2 kernels; AVX‑512 and SVE back‑ends are straightforward to add.

sum := simd.SumFloat32s([]float32{1, 2, 3, 4, 5})

alivanz/go-simd (https://github.com/alivanz/go-simd) wraps ARM NEON intrinsics for Go. The example below creates two 8‑element int8 vectors, adds and multiplies them with NEON instructions, and prints the results.

package main

import (
    "log"
    "github.com/alivanz/go-simd/arm"
    "github.com/alivanz/go-simd/arm/neon"
)

func main() {
    var a, b arm.Int8X8
    var add, mul arm.Int16X8
    for i := 0; i < 8; i++ {
        a[i] = arm.Int8(i)
        b[i] = arm.Int8(i * i)
    }
    log.Printf("a = %+v", a)
    log.Printf("b = %+v", b)
    neon.VaddlS8(&add, &a, &b)
    neon.VmullS8(&mul, &a, &b)
    log.Printf("add = %+v", add)
    log.Printf("mul = %+v", mul)
}

pehringer/simd (https://github.com/pehringer/simd) implements arithmetic, bitwise, min/max, and other primitives directly in Go assembly. Benchmarks on AMD64 and ARM64 report 100 %–400 % speedups over scalar Go code.

Go assembly for SIMD

Go permits hand‑written assembly that invokes SIMD instructions. The following AVX‑based vector‑addition routine demonstrates alignment handling, loop unrolling, and a remainder path.

// Simple AVX vector addition in Go assembly
TEXT ·add(SB), $0-32
    MOVQ a+0(FP), DI          // load pointer to a
    MOVQ b+8(FP), SI          // load pointer to b
    MOVQ result+16(FP), DX    // load pointer to result
    MOVQ len+24(FP), CX       // length in elements
    TESTQ CX, CX
    JZ done
    MOVQ CX, R8               // preserve original length
    SHRQ $2, CX               // CX = length/4 (iterations of 4‑wide vectors)
    JZ remainder
    XORQ R9, R9               // index = 0
loop:
    VMOVUPD (DI)(R9*8), Y0    // load 4 doubles from a
    VMOVUPD (SI)(R9*8), Y1    // load 4 doubles from b
    VADDPD Y0, Y1, Y0         // add
    VMOVUPD Y0, (DX)(R9*8)    // store result
    ADDQ $4, R9
    DECQ CX
    JNZ loop
remainder:
    ANDQ $3, R8               // handle leftover elements (0‑3)
    JZ done
    // remainder handling omitted for brevity
 done:
    RET

Correct 16‑byte (or 32‑byte for AVX) alignment of the three buffers is essential for peak throughput.

Conclusion

Go’s native SIMD support is limited, but the ecosystem supplies viable alternatives: auto‑generated assembly libraries (kelindar/simd), NEON wrappers for ARM (alivanz/go-simd), pure‑assembly kernels (pehringer/simd), and custom Go assembly routines. Ongoing improvements to the gc compiler and potential standard‑library SIMD APIs (as discussed in issue #67520) are expected to broaden the performance envelope for compute‑heavy Go workloads.

Performance Go Assembly SIMD Vectorization Third‑party libraries

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.