Unlock Go’s New SIMD API: Boost Performance with GOEXPERIMENT=simd
This article explains the motivation behind adding SIMD support to Go, describes the two‑level design of the experimental simd/archsimd package, provides step‑by‑step configuration and code examples for common data‑processing tasks, and presents benchmark results that show up to nearly nine‑fold speedups without extra memory allocations.
Environment and Version
Go 1.26rc2 (Windows/amd64) on a 13th Gen Intel(R) Core(TM) i5‑1335U CPU.
Why SIMD in Go and How It Was Introduced
SIMD is essential for high‑performance computing. Go historically required hand‑written assembly for SIMD, which was hard to maintain, hindered compiler optimizations, and was not portable across architectures. Community issues (#35307, #53171, #64634, #67520) requested a language‑level SIMD API without language changes.
Two‑Level Design Approach
Architecture‑specific low‑level API ( simd/archsimd )
Similar to the syscall package, providing one‑to‑one mappings to machine instructions.
Each architecture can define its own operations, prioritising performance.
High‑level portable vector API (planned)
Built on top of the low‑level API, similar to the os package.
Provides a unified, safe interface for most data‑processing and AI workloads.
Design philosophy : Most code should use the high‑level API; only rare architecture‑specific optimisations should “sink” to archsimd.
Low‑Level API Goals
Expressiveness : Cover the majority of useful hardware operations.
Relative ease of use : Keep code readable for developers without deep hardware knowledge.
Best‑effort portability : Provide a unified API when an operation is supported on multiple platforms.
Building block for high‑level API : Serve as the implementation foundation for future portable vector APIs.
Current Status and Roadmap
Go 1.26 (current): GOEXPERIMENT=simd enables the experimental simd/archsimd package for AMD64.
Short‑term: Extend support to ARM64 (NEON/SVE) and RISC‑V.
Long‑term: Design and implement a scalable high‑level vector API, eventually supporting matrix extensions such as ARM SME and Intel AMX.
Practical Code Examples and Configuration
Enabling SIMD in VS Code
Add the following to .vscode/settings.json:
{
"go.toolsEnvVars": {"GOEXPERIMENT": "simd"},
"go.testEnvVars": {"GOEXPERIMENT": "simd"},
"terminal.integrated.env.windows": {"GOEXPERIMENT": "simd"}
}Command‑line activation:
# Windows PowerShell
$env:GOEXPERIMENT="simd"; go test -bench=. -benchmem -run=^$
# Linux / macOS
GOEXPERIMENT=simd go test -bench=. -benchmem -run=^$Core Code Samples – Scalar vs SIMD
Scenario 1: Cosine Similarity
Scalar version :
func CosineSimilarityScalar(a, b []float32) float32 {
var dot, normA, normB float32
for i := range a {
ai, bi := a[i], b[i]
dot += ai * bi
normA += ai * ai
normB += bi * bi
}
return dot / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}SIMD version (using archsimd ) :
import "golang.org/x/archsimd"
func CosineSimilaritySIMD(a, b []float32) float32 {
if len(a) != len(b) { panic("vectors must have same length") }
var dotVec, normAVec, normBVec archsimd.Float32x8
i := 0
for ; i <= len(a)-8; i += 8 {
va := archsimd.LoadFloat32x8Slice(a[i:])
vb := archsimd.LoadFloat32x8Slice(b[i:])
dotVec = dotVec.Add(va.Mul(vb))
normAVec = normAVec.Add(va.Mul(va))
normBVec = normBVec.Add(vb.Mul(vb))
}
// Horizontal reductions
dotSum1 := dotVec.AddPairsGrouped(dotVec)
normASum1 := normAVec.AddPairsGrouped(normAVec)
normBSum1 := normBVec.AddPairsGrouped(normBVec)
dotSum2 := dotSum1.AddPairsGrouped(dotSum1)
normASum2 := normASum1.AddPairsGrouped(normASum1)
normBSum2 := normBSum1.AddPairsGrouped(normBSum1)
var sums [3][8]float32
dotSum2.Store(&sums[0])
normASum2.Store(&sums[1])
normBSum2.Store(&sums[2])
dotProduct := sums[0][0] + sums[0][4]
normA := sums[1][0] + sums[1][4]
normB := sums[2][0] + sums[2][4]
// Tail handling
for ; i < len(a); i++ {
ai, bi := a[i], b[i]
dotProduct += ai * bi
normA += ai * ai
normB += bi * bi
}
if normA == 0 || normB == 0 { return 0 }
return dotProduct / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}Scenario 2: Mean and Standard Deviation
Scalar version :
func ScalarMeanStd(data []float32) (mean, std float32) {
var sum float32
for _, v := range data { sum += v }
mean = sum / float32(len(data))
var sumSq float32
for _, v := range data {
diff := v - mean
sumSq += diff * diff
}
std = float32(math.Sqrt(float64(sumSq / float32(len(data)))))
return
}SIMD version :
func SimdMeanStd(data []float32) (mean, std float32) {
n := len(data)
var sumVec archsimd.Float32x8
i := 0
for ; i <= n-8; i += 8 {
v := archsimd.LoadFloat32x8Slice(data[i:])
sumVec = sumVec.Add(v)
}
sum := horizontalSum(sumVec)
for ; i < n; i++ { sum += data[i] }
mean = sum / float32(n)
broadcastMean := archsimd.BroadcastFloat32x8(mean)
var varianceVec archsimd.Float32x8
i = 0
for ; i <= n-8; i += 8 {
v := archsimd.LoadFloat32x8Slice(data[i:])
diff := v.Sub(broadcastMean)
varianceVec = varianceVec.Add(diff.Mul(diff))
}
sumSq := horizontalSum(varianceVec)
for ; i < n; i++ {
diff := data[i] - mean
sumSq += diff * diff
}
variance := sumSq / float32(n)
std = float32(math.Sqrt(float64(variance)))
return
}Scenario 3: Byte‑Array Comparison
Scalar version :
func ScalarByteCompare(a, b []byte) bool {
if len(a) != len(b) { return false }
for i := range a {
if a[i] != b[i] { return false }
}
return true
}SIMD version :
func SimdByteCompare(a, b []byte) bool {
if len(a) != len(b) { return false }
i := 0
for ; i <= len(a)-32; i += 32 {
va := archsimd.LoadUint8x32Slice(a[i:])
vb := archsimd.LoadUint8x32Slice(b[i:])
if va.Equal(vb).ToBits() != 0xFFFFFFFF { return false }
}
for ; i < len(a); i++ {
if a[i] != b[i] { return false }
}
return true
}Scenario 4: Array Summation
Scalar version :
func ScalarSum(data []float32) float32 {
var sum float32
for _, v := range data { sum += v }
return sum
}SIMD version :
func SimdSum(data []float32) float32 {
var sumVec archsimd.Float32x8
i := 0
for ; i <= len(data)-8; i += 8 {
v := archsimd.LoadFloat32x8Slice(data[i:])
sumVec = sumVec.Add(v)
}
sum := horizontalSum(sumVec)
for ; i < len(data); i++ { sum += data[i] }
return sum
}Scenario 5: Vector Dot Product
Scalar version :
func ScalarDotProduct(a, b []float32) float32 {
var dot float32
for i := range a { dot += a[i] * b[i] }
return dot
}SIMD version :
func SimdDotProduct(a, b []float32) float32 {
var dotVec archsimd.Float32x8
i := 0
for ; i <= len(a)-8; i += 8 {
va := archsimd.LoadFloat32x8Slice(a[i:])
vb := archsimd.LoadFloat32x8Slice(b[i:])
dotVec = dotVec.Add(va.Mul(vb))
}
dot := horizontalSum(dotVec)
for ; i < len(a); i++ { dot += a[i] * b[i] }
return dot
}Performance Overview
Single‑pair cosine similarity (384‑dim): scalar 203.1 ns/op → SIMD 156.7 ns/op (~1.3×), 0 B allocation.
Batch cosine similarity (1000 × 384‑dim): scalar 250 381 ns/op → SIMD 167 838 ns/op (~1.5×), 0 B allocation.
Mean & standard deviation (1024 float32): scalar 3 363 ns/op → SIMD 1 778 ns/op (~1.9×), 0 B allocation.
Byte‑array comparison (256 bytes): scalar 280.7 ns/op → SIMD 31.75 ns/op (~8.8×), 0 B allocation.
Array summation (1024 float32): scalar 1 205 ns/op → SIMD 432 ns/op (~2.8×), 0 B allocation.
Vector dot product (384‑dim): scalar 178.5 ns/op → SIMD 122.3 ns/op (~1.5×), 0 B allocation.
Key observations :
All SIMD implementations allocate zero additional memory; speedup comes purely from parallel computation.
Speedup grows with the degree of parallelism; byte‑array comparison sees the highest gain because 32 bytes are processed per SIMD lane.
Performance depends on alignment with SIMD width; leftover elements processed scalar‑wise can reduce overall gains.
Current Limitations and Future Outlook
API Gaps
Missing high‑level reduction primitives such as ReduceSum, ReduceMax, etc.
Lack of domain‑specific instructions like Gather/Scatter, VAESENC, VPTERNLOGD.
Architecture support currently limited to AMD64; ARM64 (NEON/SVE) and RISC‑V are under development.
Practical Advice for Developers
Adopt gradually: use build tags (e.g., // +build go1.26,simd,amd64) or runtime detection to fall back to scalar code when SIMD is unavailable.
Validate correctness: write tests that compare SIMD results against scalar implementations within an acceptable tolerance (e.g., 1e‑6).
Conclusion
The experimental simd/archsimd package in Go 1.26rc2 demonstrates that Go can achieve substantial performance gains—ranging from 1.3× to nearly 9×—without any extra memory allocations. While the API is still experimental and lacks some advanced reduction operations and broader architecture support, it provides a clear migration path from scalar to vectorised code. Future releases adding more architectures and a high‑level portable vector API will make Go a stronger contender for machine‑learning inference, scientific computing, and real‑time data‑processing workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
