How Go’s Netpoller Powers Millions of Connections – 5 Real‑World Cases

This article explains why Go programs often fail to reach C10M concurrency, analyzes five real‑world incidents, reveals the inner workings of Go's netpoller, and provides concrete code‑level optimizations, configuration tweaks, and load‑testing practices to achieve stable million‑connection services.

Code Wrench
Code Wrench
Code Wrench
How Go’s Netpoller Powers Millions of Connections – 5 Real‑World Cases
Abstract : Go can theoretically handle millions of concurrent connections, but many services fail due to misuse of the netpoller. This summary extracts the key technical findings from five real‑world cases, explains Go’s network model, the netpoller implementation, and provides concrete optimisation patterns for building stable million‑connection services.

1. Real‑world Failure: 85 k Connections on a 32‑core Server

Monitoring showed 85 000 established TCP connections, 95 % CPU usage and 60 GB memory on a 32‑core, 128 GB RAM machine. Each connection consumed two file descriptors, resulting in roughly 170 000 goroutines. The excessive goroutine count caused massive context‑switch overhead and CPU saturation.

$ lsof -p 12345 | wc -l
165432   # ~160 k file descriptors
$ netstat -an | grep ESTABLISHED | wc -l
85234    # actual connections
# Approx. 2 FDs per connection

Problem code created a separate goroutine for reading and writing:

// ❌ two goroutine per connection
func handleConn(conn net.Conn) {
    go readLoop(conn)   // goroutine 1
    go writeLoop(conn)  // goroutine 2
}

Fix – handle the connection in a single goroutine using buffered I/O and an explicit loop:

// ✅ single goroutine per connection
func handleConn(conn net.Conn) {
    defer conn.Close()
    reader := bufio.NewReader(conn)
    writer := bufio.NewWriter(conn)
    for {
        msg, err := readMessage(reader)
        if err != nil { return }
        response := processMessage(msg)
        writeMessage(writer, response)
        writer.Flush()
    }
}

After the change CPU dropped to 15 %, memory to 8 GB and the service supported >500 k connections.

2. Three Secrets of Go’s Network Model

Secret 1 – Synchronous Code, Asynchronous Execution

Application code appears synchronous (e.g., conn.Read(buf) blocks), but the runtime performs the following steps:

System call returns EAGAIN when data is not ready.

The file descriptor is added to epoll/kqueue.

The current goroutine parks (gopark) while the OS thread continues executing other goroutines.

When the fd becomes ready, the poller wakes the goroutine, which resumes the read.

This model combines the simplicity of synchronous code with the scalability of event‑driven I/O.

Secret 2 – Extremely Low Memory Footprint

Memory consumption for 1 million connections:

Thread‑per‑connection: 1 MB stack × 1 M = 1 TB (impractical).

Node.js (single thread): ~2‑4 GB, but cannot utilise multiple cores.

Go netpoller: 8 KB goroutine stack + ~17 KB TCP buffer ≈ 25 GB total.

In a 64 GB, 16‑core test the process used 28 GB and 8‑12 % CPU, confirming feasibility.

Secret 3 – Scheduler Overhead Much Lower Than Threads

Typical costs:

Thread switch: ~1‑2 µs (kernel‑user transition, TLB flush).

Goroutine switch: ~0.2 µs (user‑mode only).

For 1 M connections with one switch per second, thread switching would consume 1‑2 seconds of CPU time, whereas goroutine switching consumes only 0.2 seconds.

3. Netpoller Working Principle

Core Data Structure (simplified)

// runtime/netpoll.go
type pollDesc struct {
    rg uintptr // read‑waiting goroutine
    wg uintptr // write‑waiting goroutine
    rd int64   // read timeout
    wd int64   // write timeout
}
var (
    epfd int32 // Linux epoll fd
    kqfd int32 // BSD/kqueue fd
)

Read Flow Example

1. User code: conn.Read(buf)
2. runtime: internal/poll.(*FD).Read()
3. Set fd non‑blocking (syscall.SetNonblock)
4. Attempt read → EAGAIN?
   ├─ No → return data
   └─ Yes → add fd to epoll/kqueue (epoll_ctl)
5. Park current goroutine (gopark)
6. Background poller: epoll_wait() → fd ready
7. Wake goroutine (goready) and place it on run queue
8. Retry syscall.Read() → success

The netpoller is tightly integrated with the scheduler; no separate event‑loop thread is required.

4. Five Optimisation Cases for Million‑Connection Services

Case 1 – Avoid Multiple Goroutine per Connection

Creating two goroutines per connection doubles memory usage (8 KB stack each). The recommended pattern is a single goroutine handling both read and write, reducing memory by ~50 %.

Case 2 – Buffer Size Tuning

Using a tiny 128‑byte buffer generates millions of syscalls. Switching to a 32 KB buffered reader/writer reduces syscalls dramatically and can increase throughput by an order of magnitude.

reader := bufio.NewReaderSize(conn, 32*1024)
writer := bufio.NewWriterSize(conn, 32*1024)

Empirical recommendations (based on message size):

Small messages (<1 KB): 4‑8 KB buffer.

Medium messages (1‑10 KB): 16‑32 KB buffer.

Large messages (>100 KB): 64‑128 KB buffer.

Case 3 – Set Timeouts to Prevent Resource Leaks

Connections without read/write deadlines can stay idle indefinitely, allowing slow‑loris attacks to exhaust file descriptors. Apply per‑operation deadlines or an idle timer.

for {
    conn.SetReadDeadline(time.Now().Add(30 * time.Second))
    n, err := conn.Read(buf)
    if err != nil { return }
    // process buf[:n]
}

Case 4 – Connection Pool for Outbound Connections

Creating a new TCP connection for every request incurs a three‑way handshake and consumes ports (TIME_WAIT). A simple pool reuses net.Conn objects and falls back to a factory when the pool is empty.

type ConnPool struct {
    conns   chan net.Conn
    factory func() (net.Conn, error)
    maxConns int
}
func (p *ConnPool) Get() (net.Conn, error) { /* … */ }
func (p *ConnPool) Put(conn net.Conn) { /* … */ }

var dbPool = NewConnPool(100, func() (net.Conn, error) { return net.Dial("tcp", "db:3306") })

The standard library http.Client already embeds a connection pool; creating a new client per request should be avoided.

Case 5 – System‑level Tuning and Listener Configuration

Linux kernel parameters required for million‑connection servers:

File‑descriptor limit: ulimit -n 1048576 TCP buffers: net.core.rmem_max = 134217728 and net.core.wmem_max = 134217728 Reuse TIME_WAIT: net.ipv4.tcp_tw_reuse = 1 Connection queue: net.core.somaxconn = 65535 and net.ipv4.tcp_max_syn_backlog = 8192 Port range: net.ipv4.ip_local_port_range = 10000 65535 Listener can be created with SO_REUSEADDR and SO_REUSEPORT via net.ListenConfig.Control. Individual connections should enable TCP_NODELAY, keep‑alive and a reasonable keep‑alive period.

func createListener(addr string) (net.Listener, error) {
    lc := net.ListenConfig{Control: func(network, address string, c syscall.RawConn) error {
        return c.Control(func(fd uintptr) {
            syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEADDR, 1)
            syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEPORT, 1)
        })
    })}
    return lc.Listen(context.Background(), "tcp", addr)
}

func configConn(conn net.Conn) error {
    if tcp, ok := conn.(*net.TCPConn); ok {
        tcp.SetNoDelay(true)
        tcp.SetKeepAlive(true)
        tcp.SetKeepAlivePeriod(3 * time.Minute)
    }
    return nil
}

5. Load‑Testing Practice

Server (echo) implementation

package main
import (
    "bufio"
    "flag"
    "fmt"
    "log"
    "net"
    "runtime"
    "sync/atomic"
    "time"
)
var (
    addr  = flag.String("addr", ":8080", "listen address")
    conns int64
)
func main() {
    flag.Parse()
    ln, err := net.Listen("tcp", *addr)
    if err != nil { log.Fatal(err) }
    log.Printf("Listening on %s", *addr)
    go monitor()
    for {
        conn, err := ln.Accept()
        if err != nil { continue }
        atomic.AddInt64(&conns, 1)
        go handleConn(conn)
    }
}
func handleConn(conn net.Conn) {
    defer func(){ conn.Close(); atomic.AddInt64(&conns, -1) }()
    if tcp, ok := conn.(*net.TCPConn); ok {
        tcp.SetNoDelay(true)
        tcp.SetKeepAlive(true)
        tcp.SetKeepAlivePeriod(3 * time.Minute)
    }
    reader := bufio.NewReaderSize(conn, 4096)
    writer := bufio.NewWriterSize(conn, 4096)
    for {
        conn.SetReadDeadline(time.Now().Add(5 * time.Minute))
        line, err := reader.ReadBytes('
')
        if err != nil { return }
        writer.Write(line)
        writer.Flush()
    }
}
func monitor() {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    for range ticker.C {
        var m runtime.MemStats
        runtime.ReadMemStats(&m)
        fmt.Printf("Connections: %d, Goroutines: %d, Memory: %.2f GB
",
            atomic.LoadInt64(&conns), runtime.NumGoroutine(), float64(m.Alloc)/1<<30)
    }
}

Client (stress) implementation

package main
import (
    "flag"
    "log"
    "net"
    "sync"
    "sync/atomic"
    "time"
)
var (
    target      = flag.String("target", "localhost:8080", "target server")
    count       = flag.Int("c", 100000, "number of connections")
    rate        = flag.Int("rate", 1000, "connections per second")
    established int64
    errors      int64
)
func main() {
    flag.Parse()
    log.Printf("Target: %s, Connections: %d", *target, *count)
    go monitor()
    ticker := time.NewTicker(time.Second / time.Duration(*rate))
    defer ticker.Stop()
    var wg sync.WaitGroup
    for i := 0; i < *count; i++ {
        <-ticker.C
        wg.Add(1)
        go func(){
            defer wg.Done()
            if err := connect(); err != nil { atomic.AddInt64(&errors, 1) }
        }()
    }
    wg.Wait()
    select {}
}
func connect() error {
    conn, err := net.DialTimeout("tcp", *target, 10*time.Second)
    if err != nil { return err }
    atomic.AddInt64(&established, 1)
    go func(){
        defer conn.Close()
        ticker := time.NewTicker(30 * time.Second)
        defer ticker.Stop()
        for range ticker.C {
            conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
            if _, err := conn.Write([]byte("ping
")); err != nil { return }
            buf := make([]byte, 5)
            conn.SetReadDeadline(time.Now().Add(5 * time.Second))
            if _, err := conn.Read(buf); err != nil { return }
        }
    }()
    return nil
}
func monitor() {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    for range ticker.C {
        log.Printf("Established: %d, Errors: %d", atomic.LoadInt64(&established), atomic.LoadInt64(&errors))
    }
}

Typical test steps:

Apply the kernel parameters listed in Case 5.

Run the server: go run server.go -addr :8080.

Run the client on another host:

go run client.go -target <em>server‑ip</em>:8080 -c 1000000 -rate 5000

.

Observe server metrics (connections, goroutine count, memory) and client metrics (established connections, errors).

6. Core Takeaways

Advantages of Go’s Network Model

Developer‑friendly : Write straightforward synchronous code.

High performance : Runtime converts it to non‑blocking asynchronous execution.

Scalable : Same code works from a few connections up to millions.

Multi‑core aware : Scheduler automatically balances work across CPUs.

Key Factors for Million‑Connection Services

One goroutine per connection (avoid extra goroutine).

Buffer size 16‑32 KB for typical workloads.

Set appropriate read/write and idle timeouts.

Tune OS limits (ulimit, sysctl).

Reuse outbound connections with a pool.

Continuously monitor and load‑test.

Common Misconceptions

More goroutine always improves throughput – false.

Timeouts are optional – false.

Default OS/network settings are sufficient for millions of connections – false.

Skipping load‑testing is safe – false.

Optimisation Roadmap

Stage 1: Functional correctness – get the service running.
Stage 2: Single‑connection tuning – buffers, deadlines, error handling.
Stage 3: Concurrency tuning – goroutine count, connection pool.
Stage 4: System tuning – kernel parameters, monitoring.
Stage 5: Load‑test validation – identify bottlenecks, iterate.

7. Final Thoughts

Understanding the netpoller lets you write high‑performance network services, quickly locate bottlenecks, and make informed architectural decisions. The initial failure was caused by treating Go like a thread‑per‑connection model and spawning unnecessary goroutine pairs. Keep the code simple, let the runtime handle asynchronous I/O, and you can reliably achieve million‑connection scalability.

GoPerformance TuningHigh Concurrencynetwork programmingnetpoller
Code Wrench
Written by

Code Wrench

Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.