Backend Development 17 min read

Optimizing Vector Retrieval in Go: SIMD and Plan9 Assembly for High‑Performance Vector Search

This article presents a backend‑focused study on reducing latency of vector‑based ad recommendation retrieval by leveraging Gonum, SIMD AVX2 intrinsics, and direct Plan9 assembly integration in Go, and it validates the approach with detailed performance benchmarks and CPU usage analysis.

IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Optimizing Vector Retrieval in Go: SIMD and Plan9 Assembly for High‑Performance Vector Search

Background – In high‑throughput ad recommendation scenarios, slow vector‑search latency in the recall service leads to timeouts and degraded downstream ranking performance. Storing sub‑million‑size vectors in‑process memory can eliminate network overhead but raises CPU and memory consumption.

Solution Exploration – The author first evaluated Gonum, an open‑source Go scientific library, for float32 inner‑product calculations. While Gonum’s parallel implementation achieved an 8× speed‑up, its limited function set (no cosine or Euclidean distance) prompted a deeper dive into SIMD.

SIMD Computation – SIMD (Single Instruction Multiple Data) using Intel AVX2 256‑bit registers can process eight 32‑bit floats per instruction. The article provides concrete AVX2 implementations for inner‑product, Euclidean distance, and cosine similarity, each loading data with _mm256_loadu_ps , performing vectorized arithmetic, and reducing results with horizontal adds.

void VecInnerProductAVX2(const float* x, const float* y, float* z) {
    int d = 256;
    __m256 msum1 = _mm256_setzero_ps();
    while (d >= 8) {
        __m256 mx = _mm256_loadu_ps(x); x += 8;
        __m256 my = _mm256_loadu_ps(y); y += 8;
        msum1 = _mm256_add_ps(msum1, _mm256_mul_ps(mx, my));
        d -= 8;
    }
    __m128 msum2 = _mm256_extractf128_ps(msum1, 1);
    msum2 = _mm_add_ps(msum2, _mm256_extractf128_ps(msum1, 0));
    msum2 = _mm_hadd_ps(msum2, msum2);
    msum2 = _mm_hadd_ps(msum2, msum2);
    _mm_storeu_ps(z, msum2);
}

Calling SIMD from Go – Two integration methods were compared:

CGO : Direct C function calls incur context‑switch overhead, yielding only ~2× speed‑up.

Plan9 Assembly : Using the c2goasm and asm2plan9s toolchain, C/AVX2 assembly is converted to Go’s Plan9 assembly format. The resulting Go function (e.g., _VecInnerProductAVX2 ) can be invoked without CGO overhead.

Example Go declaration for the Plan9 assembly function:

//go:noescape
func _VecInnerProductAVX2(x, y, z *float32)

Performance Comparison – Benchmarks on a 4‑core Intel Xeon VM (CentOS 8) show:

Gonum: ~8× faster than naïve Go inner‑product.

SIMD‑CGO: ~2× faster.

SIMD‑Plan9 Assembly: ~8.7× faster, with the lowest CPU utilization.

Similar gains were observed for Euclidean and cosine distance calculations.

Conclusion – By converting AVX2 intrinsics to Plan9 assembly and invoking them directly from Go, the solution achieves the best overall performance, reducing latency and CPU load for in‑process vector retrieval, and is recommended for production deployment.

backendperformanceoptimizationGoVector SearchassemblySIMD
IEG Growth Platform Technology Team
Written by

IEG Growth Platform Technology Team

Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.