Industry Insights 9 min read

Why Go’s Memory Usage Explodes in Million‑Thread Benchmarks – A Deep Dive

The article analyses a large‑scale benchmark comparing Go, C, Rust, C# and other languages under single, 100 k and 1 M concurrent tasks, revealing how Go’s 64‑bit int array size and goroutine stack overhead cause dramatically higher memory consumption despite comparable CPU performance.

BirdNest Tech Talk

Dec 3, 2024

Why Go’s Memory Usage Explodes in Million‑Thread Benchmarks – A Deep Dive

Cache‑line impact on the 10 billion‑iteration benchmark

The benchmark iterates over an array ten billion times. In C the array element type is int32_t (4 bytes); Rust uses u32 (4 bytes); Go uses int, which on a 64‑bit OS is int64 (8 bytes). A 64‑byte cache line therefore holds 16 elements for C/Rust but only 8 for Go, halving the amount of data loaded per cache‑line fetch.

A community pull‑request changed the Go array element to int32. After rebuilding, Go’s runtime matched the execution time of C, Rust and Zig, confirming that element size was the dominant factor.

Massive‑concurrency memory‑usage benchmark

Three concurrency scenarios were measured:

Single task.

100 000 concurrent tasks.

1 000 000 concurrent tasks.

Each task simply sleeps for 10 seconds. The languages use their native concurrency primitives: go for Go, async/await for Rust and C#, and the same pattern for Zig (not shown). CPU time and memory usage were recorded.

Observed results

Single task – Rust, C# and Go show comparable CPU time and low memory.

100 k tasks – Rust and C# keep memory modest; Go’s memory usage begins to increase noticeably.

1 M tasks – Go’s RSS grows to several gigabytes, while Rust and C# stay under 500 MB.

Instrumentation to expose Go’s memory spike

A helper printUsage function was added. It sleeps 5 seconds, then reads runtime.MemStats and prints Alloc, StackSys, Sys. It also runs ps -o rss -p $PID to capture the OS‑level RSS.

func printUsage() {
    time.Sleep(5 * time.Second)
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    fmt.Printf("Alloc = %v MiB
", bToMb(m.Alloc))
    fmt.Printf("Stack = %v MiB
", bToMb(m.StackSys))
    fmt.Printf("Sys = %v MiB
", bToMb(m.Sys))
    output, err := exec.Command("ps", "-o", "rss", "-p", fmt.Sprintf("%d", os.Getpid())).Output()
    if err == nil {
        lines := strings.Split(string(output), "
")
        if len(lines) >= 2 {
            rss, err := strconv.ParseInt(strings.TrimSpace(lines[1]), 10, 64)
            if err == nil {
                fmt.Printf("RSS: %v MiB
", bToMb(uint64(rss)*1024))
            }
        }
    }
}

func bToMb(b uint64) uint64 { return b / 1024 / 1024 }

package main

import (
    "fmt"
    "os"
    "os/exec"
    "strconv"
    "strings"
    "sync"
    "time"
)

func main() {
    fmt.Printf("pid: %d
", os.Getpid())
    numRoutines := 100000
    if len(os.Args) > 1 {
        if n, err := strconv.Atoi(os.Args[1]); err == nil {
            numRoutines = n
        }
    }
    start := time.Now()
    var wg sync.WaitGroup
    for i := 0; i < numRoutines; i++ {
        wg.Add(1)
        go func() {
            time.Sleep(10 * time.Second)
            wg.Done()
        }()
    }
    go printUsage()
    wg.Wait()
    fmt.Printf("Time taken = %v
", time.Since(start))
}

Analysis of the memory profile

On an older Linux host the RSS reached several gigabytes. The stack contribution alone was about 2 KB per goroutine – the default stack size grew from 4 KB in Go 1.2 to 2 KB in Go 1.4 and is now dynamically sized (Go 1.19). The heap added hundreds of megabytes because each goroutine allocates a small runtime object g that consumes roughly 435 MB in total.

By contrast, Rust futures occupy 64‑128 bytes each, and C# async tasks use roughly 100‑200 bytes. Consequently, at 1 M concurrent units Rust and C# stay under 500 MB while Go exceeds several gigabytes.

Attempts to mitigate the blow‑up

Changing the default stack size via GODEBUG=stacktrace=1 (or similar) – no measurable reduction.

Introducing a goroutine pool to reuse stacks – memory remained high.

Using a time‑wheel scheduler – did not affect the per‑goroutine overhead.

References

First edition of the benchmark: https://hez2010.github.io/async-runtimes-benchmarks-2024

Second edition (updated code and results): https://hez2010.github.io/async-runtimes-benchmarks-2024/take2.html

Performance concurrency Go benchmark programming languages Memory Usage

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Cache‑line impact on the 10 billion‑iteration benchmark

Massive‑concurrency memory‑usage benchmark

Observed results

Instrumentation to expose Go’s memory spike

Analysis of the memory profile

Attempts to mitigate the blow‑up

References

BirdNest Tech Talk

How this landed with the community

Was this worth your time?

0 Comments

Cache‑line impact on the 10 billion‑iteration benchmark