Backend Development 17 min read

Unlock Go’s New SIMD API: Boost Performance with GOEXPERIMENT=simd

This article explains the motivation behind adding SIMD support to Go, describes the two‑level design of the experimental simd/archsimd package, provides step‑by‑step configuration and code examples for common data‑processing tasks, and presents benchmark results that show up to nearly nine‑fold speedups without extra memory allocations.

Tech Musings

Jan 16, 2026

Unlock Go’s New SIMD API: Boost Performance with GOEXPERIMENT=simd

Environment and Version

Go 1.26rc2 (Windows/amd64) on a 13th Gen Intel(R) Core(TM) i5‑1335U CPU.

Why SIMD in Go and How It Was Introduced

SIMD is essential for high‑performance computing. Go historically required hand‑written assembly for SIMD, which was hard to maintain, hindered compiler optimizations, and was not portable across architectures. Community issues (#35307, #53171, #64634, #67520) requested a language‑level SIMD API without language changes.

Two‑Level Design Approach

Architecture‑specific low‑level API ( simd/archsimd )

Similar to the syscall package, providing one‑to‑one mappings to machine instructions.

Each architecture can define its own operations, prioritising performance.

High‑level portable vector API (planned)

Built on top of the low‑level API, similar to the os package.

Provides a unified, safe interface for most data‑processing and AI workloads.

Design philosophy : Most code should use the high‑level API; only rare architecture‑specific optimisations should “sink” to archsimd.

Low‑Level API Goals

Expressiveness : Cover the majority of useful hardware operations.

Relative ease of use : Keep code readable for developers without deep hardware knowledge.

Best‑effort portability : Provide a unified API when an operation is supported on multiple platforms.

Building block for high‑level API : Serve as the implementation foundation for future portable vector APIs.

Current Status and Roadmap

Go 1.26 (current): GOEXPERIMENT=simd enables the experimental simd/archsimd package for AMD64.

Short‑term: Extend support to ARM64 (NEON/SVE) and RISC‑V.

Long‑term: Design and implement a scalable high‑level vector API, eventually supporting matrix extensions such as ARM SME and Intel AMX.

Practical Code Examples and Configuration

Enabling SIMD in VS Code

Add the following to .vscode/settings.json:

{
  "go.toolsEnvVars": {"GOEXPERIMENT": "simd"},
  "go.testEnvVars": {"GOEXPERIMENT": "simd"},
  "terminal.integrated.env.windows": {"GOEXPERIMENT": "simd"}
}

Command‑line activation:

# Windows PowerShell
$env:GOEXPERIMENT="simd"; go test -bench=. -benchmem -run=^$

# Linux / macOS
GOEXPERIMENT=simd go test -bench=. -benchmem -run=^$

Core Code Samples – Scalar vs SIMD

Scenario 1: Cosine Similarity

Scalar version :

func CosineSimilarityScalar(a, b []float32) float32 {
    var dot, normA, normB float32
    for i := range a {
        ai, bi := a[i], b[i]
        dot += ai * bi
        normA += ai * ai
        normB += bi * bi
    }
    return dot / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}

SIMD version (using archsimd ) :

import "golang.org/x/archsimd"

func CosineSimilaritySIMD(a, b []float32) float32 {
    if len(a) != len(b) { panic("vectors must have same length") }
    var dotVec, normAVec, normBVec archsimd.Float32x8
    i := 0
    for ; i <= len(a)-8; i += 8 {
        va := archsimd.LoadFloat32x8Slice(a[i:])
        vb := archsimd.LoadFloat32x8Slice(b[i:])
        dotVec = dotVec.Add(va.Mul(vb))
        normAVec = normAVec.Add(va.Mul(va))
        normBVec = normBVec.Add(vb.Mul(vb))
    }
    // Horizontal reductions
    dotSum1 := dotVec.AddPairsGrouped(dotVec)
    normASum1 := normAVec.AddPairsGrouped(normAVec)
    normBSum1 := normBVec.AddPairsGrouped(normBVec)

    dotSum2 := dotSum1.AddPairsGrouped(dotSum1)
    normASum2 := normASum1.AddPairsGrouped(normASum1)
    normBSum2 := normBSum1.AddPairsGrouped(normBSum1)

    var sums [3][8]float32
    dotSum2.Store(&sums[0])
    normASum2.Store(&sums[1])
    normBSum2.Store(&sums[2])

    dotProduct := sums[0][0] + sums[0][4]
    normA := sums[1][0] + sums[1][4]
    normB := sums[2][0] + sums[2][4]

    // Tail handling
    for ; i < len(a); i++ {
        ai, bi := a[i], b[i]
        dotProduct += ai * bi
        normA += ai * ai
        normB += bi * bi
    }
    if normA == 0 || normB == 0 { return 0 }
    return dotProduct / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}

Scenario 2: Mean and Standard Deviation

Scalar version :

func ScalarMeanStd(data []float32) (mean, std float32) {
    var sum float32
    for _, v := range data { sum += v }
    mean = sum / float32(len(data))
    var sumSq float32
    for _, v := range data {
        diff := v - mean
        sumSq += diff * diff
    }
    std = float32(math.Sqrt(float64(sumSq / float32(len(data)))))
    return
}

SIMD version :

func SimdMeanStd(data []float32) (mean, std float32) {
    n := len(data)
    var sumVec archsimd.Float32x8
    i := 0
    for ; i <= n-8; i += 8 {
        v := archsimd.LoadFloat32x8Slice(data[i:])
        sumVec = sumVec.Add(v)
    }
    sum := horizontalSum(sumVec)
    for ; i < n; i++ { sum += data[i] }
    mean = sum / float32(n)

    broadcastMean := archsimd.BroadcastFloat32x8(mean)
    var varianceVec archsimd.Float32x8
    i = 0
    for ; i <= n-8; i += 8 {
        v := archsimd.LoadFloat32x8Slice(data[i:])
        diff := v.Sub(broadcastMean)
        varianceVec = varianceVec.Add(diff.Mul(diff))
    }
    sumSq := horizontalSum(varianceVec)
    for ; i < n; i++ {
        diff := data[i] - mean
        sumSq += diff * diff
    }
    variance := sumSq / float32(n)
    std = float32(math.Sqrt(float64(variance)))
    return
}

Scenario 3: Byte‑Array Comparison

Scalar version :

func ScalarByteCompare(a, b []byte) bool {
    if len(a) != len(b) { return false }
    for i := range a {
        if a[i] != b[i] { return false }
    }
    return true
}

SIMD version :

func SimdByteCompare(a, b []byte) bool {
    if len(a) != len(b) { return false }
    i := 0
    for ; i <= len(a)-32; i += 32 {
        va := archsimd.LoadUint8x32Slice(a[i:])
        vb := archsimd.LoadUint8x32Slice(b[i:])
        if va.Equal(vb).ToBits() != 0xFFFFFFFF { return false }
    }
    for ; i < len(a); i++ {
        if a[i] != b[i] { return false }
    }
    return true
}

Scenario 4: Array Summation

Scalar version :

func ScalarSum(data []float32) float32 {
    var sum float32
    for _, v := range data { sum += v }
    return sum
}

SIMD version :

func SimdSum(data []float32) float32 {
    var sumVec archsimd.Float32x8
    i := 0
    for ; i <= len(data)-8; i += 8 {
        v := archsimd.LoadFloat32x8Slice(data[i:])
        sumVec = sumVec.Add(v)
    }
    sum := horizontalSum(sumVec)
    for ; i < len(data); i++ { sum += data[i] }
    return sum
}

Scenario 5: Vector Dot Product

Scalar version :

func ScalarDotProduct(a, b []float32) float32 {
    var dot float32
    for i := range a { dot += a[i] * b[i] }
    return dot
}

SIMD version :

func SimdDotProduct(a, b []float32) float32 {
    var dotVec archsimd.Float32x8
    i := 0
    for ; i <= len(a)-8; i += 8 {
        va := archsimd.LoadFloat32x8Slice(a[i:])
        vb := archsimd.LoadFloat32x8Slice(b[i:])
        dotVec = dotVec.Add(va.Mul(vb))
    }
    dot := horizontalSum(dotVec)
    for ; i < len(a); i++ { dot += a[i] * b[i] }
    return dot
}

Performance Overview

Single‑pair cosine similarity (384‑dim): scalar 203.1 ns/op → SIMD 156.7 ns/op (~1.3×), 0 B allocation.

Batch cosine similarity (1000 × 384‑dim): scalar 250 381 ns/op → SIMD 167 838 ns/op (~1.5×), 0 B allocation.

Mean & standard deviation (1024 float32): scalar 3 363 ns/op → SIMD 1 778 ns/op (~1.9×), 0 B allocation.

Byte‑array comparison (256 bytes): scalar 280.7 ns/op → SIMD 31.75 ns/op (~8.8×), 0 B allocation.

Array summation (1024 float32): scalar 1 205 ns/op → SIMD 432 ns/op (~2.8×), 0 B allocation.

Vector dot product (384‑dim): scalar 178.5 ns/op → SIMD 122.3 ns/op (~1.5×), 0 B allocation.

Key observations :

All SIMD implementations allocate zero additional memory; speedup comes purely from parallel computation.

Speedup grows with the degree of parallelism; byte‑array comparison sees the highest gain because 32 bytes are processed per SIMD lane.

Performance depends on alignment with SIMD width; leftover elements processed scalar‑wise can reduce overall gains.

Current Limitations and Future Outlook

API Gaps

Missing high‑level reduction primitives such as ReduceSum, ReduceMax, etc.

Lack of domain‑specific instructions like Gather/Scatter, VAESENC, VPTERNLOGD.

Architecture support currently limited to AMD64; ARM64 (NEON/SVE) and RISC‑V are under development.

Practical Advice for Developers

Adopt gradually: use build tags (e.g., // +build go1.26,simd,amd64) or runtime detection to fall back to scalar code when SIMD is unavailable.

Validate correctness: write tests that compare SIMD results against scalar implementations within an acceptable tolerance (e.g., 1e‑6).

Conclusion

The experimental simd/archsimd package in Go 1.26rc2 demonstrates that Go can achieve substantial performance gains—ranging from 1.3× to nearly 9×—without any extra memory allocations. While the API is still experimental and lacks some advanced reduction operations and broader architecture support, it provides a clear migration path from scalar to vectorised code. Future releases adding more architectures and a high‑level portable vector API will make Go a stronger contender for machine‑learning inference, scientific computing, and real‑time data‑processing workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Go benchmark SIMD archsimd GOEXPERIMENT

Written by

Tech Musings

Capturing thoughts and reflections while coding.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Environment and Version

Why SIMD in Go and How It Was Introduced

Two‑Level Design Approach

Low‑Level API Goals

Current Status and Roadmap

Practical Code Examples and Configuration

Enabling SIMD in VS Code

Core Code Samples – Scalar vs SIMD

Scenario 1: Cosine Similarity

Scenario 2: Mean and Standard Deviation

Scenario 3: Byte‑Array Comparison

Scenario 4: Array Summation

Scenario 5: Vector Dot Product

Performance Overview

Current Limitations and Future Outlook

API Gaps

Practical Advice for Developers

Conclusion

Tech Musings

How this landed with the community

Was this worth your time?

0 Comments

Enabling SIMD in VS Code

Scenario 1: Cosine Similarity

Scenario 2: Mean and Standard Deviation

Scenario 3: Byte‑Array Comparison

Scenario 4: Array Summation

Scenario 5: Vector Dot Product