How Go’s Netpoller Powers Millions of Connections – 5 Real‑World Cases
This article explains why Go programs often fail to reach C10M concurrency, analyzes five real‑world incidents, reveals the inner workings of Go's netpoller, and provides concrete code‑level optimizations, configuration tweaks, and load‑testing practices to achieve stable million‑connection services.
Abstract : Go can theoretically handle millions of concurrent connections, but many services fail due to misuse of the netpoller. This summary extracts the key technical findings from five real‑world cases, explains Go’s network model, the netpoller implementation, and provides concrete optimisation patterns for building stable million‑connection services.
1. Real‑world Failure: 85 k Connections on a 32‑core Server
Monitoring showed 85 000 established TCP connections, 95 % CPU usage and 60 GB memory on a 32‑core, 128 GB RAM machine. Each connection consumed two file descriptors, resulting in roughly 170 000 goroutines. The excessive goroutine count caused massive context‑switch overhead and CPU saturation.
$ lsof -p 12345 | wc -l
165432 # ~160 k file descriptors
$ netstat -an | grep ESTABLISHED | wc -l
85234 # actual connections
# Approx. 2 FDs per connectionProblem code created a separate goroutine for reading and writing:
// ❌ two goroutine per connection
func handleConn(conn net.Conn) {
go readLoop(conn) // goroutine 1
go writeLoop(conn) // goroutine 2
}Fix – handle the connection in a single goroutine using buffered I/O and an explicit loop:
// ✅ single goroutine per connection
func handleConn(conn net.Conn) {
defer conn.Close()
reader := bufio.NewReader(conn)
writer := bufio.NewWriter(conn)
for {
msg, err := readMessage(reader)
if err != nil { return }
response := processMessage(msg)
writeMessage(writer, response)
writer.Flush()
}
}After the change CPU dropped to 15 %, memory to 8 GB and the service supported >500 k connections.
2. Three Secrets of Go’s Network Model
Secret 1 – Synchronous Code, Asynchronous Execution
Application code appears synchronous (e.g., conn.Read(buf) blocks), but the runtime performs the following steps:
System call returns EAGAIN when data is not ready.
The file descriptor is added to epoll/kqueue.
The current goroutine parks (gopark) while the OS thread continues executing other goroutines.
When the fd becomes ready, the poller wakes the goroutine, which resumes the read.
This model combines the simplicity of synchronous code with the scalability of event‑driven I/O.
Secret 2 – Extremely Low Memory Footprint
Memory consumption for 1 million connections:
Thread‑per‑connection: 1 MB stack × 1 M = 1 TB (impractical).
Node.js (single thread): ~2‑4 GB, but cannot utilise multiple cores.
Go netpoller: 8 KB goroutine stack + ~17 KB TCP buffer ≈ 25 GB total.
In a 64 GB, 16‑core test the process used 28 GB and 8‑12 % CPU, confirming feasibility.
Secret 3 – Scheduler Overhead Much Lower Than Threads
Typical costs:
Thread switch: ~1‑2 µs (kernel‑user transition, TLB flush).
Goroutine switch: ~0.2 µs (user‑mode only).
For 1 M connections with one switch per second, thread switching would consume 1‑2 seconds of CPU time, whereas goroutine switching consumes only 0.2 seconds.
3. Netpoller Working Principle
Core Data Structure (simplified)
// runtime/netpoll.go
type pollDesc struct {
rg uintptr // read‑waiting goroutine
wg uintptr // write‑waiting goroutine
rd int64 // read timeout
wd int64 // write timeout
}
var (
epfd int32 // Linux epoll fd
kqfd int32 // BSD/kqueue fd
)Read Flow Example
1. User code: conn.Read(buf)
2. runtime: internal/poll.(*FD).Read()
3. Set fd non‑blocking (syscall.SetNonblock)
4. Attempt read → EAGAIN?
├─ No → return data
└─ Yes → add fd to epoll/kqueue (epoll_ctl)
5. Park current goroutine (gopark)
6. Background poller: epoll_wait() → fd ready
7. Wake goroutine (goready) and place it on run queue
8. Retry syscall.Read() → successThe netpoller is tightly integrated with the scheduler; no separate event‑loop thread is required.
4. Five Optimisation Cases for Million‑Connection Services
Case 1 – Avoid Multiple Goroutine per Connection
Creating two goroutines per connection doubles memory usage (8 KB stack each). The recommended pattern is a single goroutine handling both read and write, reducing memory by ~50 %.
Case 2 – Buffer Size Tuning
Using a tiny 128‑byte buffer generates millions of syscalls. Switching to a 32 KB buffered reader/writer reduces syscalls dramatically and can increase throughput by an order of magnitude.
reader := bufio.NewReaderSize(conn, 32*1024)
writer := bufio.NewWriterSize(conn, 32*1024)Empirical recommendations (based on message size):
Small messages (<1 KB): 4‑8 KB buffer.
Medium messages (1‑10 KB): 16‑32 KB buffer.
Large messages (>100 KB): 64‑128 KB buffer.
Case 3 – Set Timeouts to Prevent Resource Leaks
Connections without read/write deadlines can stay idle indefinitely, allowing slow‑loris attacks to exhaust file descriptors. Apply per‑operation deadlines or an idle timer.
for {
conn.SetReadDeadline(time.Now().Add(30 * time.Second))
n, err := conn.Read(buf)
if err != nil { return }
// process buf[:n]
}Case 4 – Connection Pool for Outbound Connections
Creating a new TCP connection for every request incurs a three‑way handshake and consumes ports (TIME_WAIT). A simple pool reuses net.Conn objects and falls back to a factory when the pool is empty.
type ConnPool struct {
conns chan net.Conn
factory func() (net.Conn, error)
maxConns int
}
func (p *ConnPool) Get() (net.Conn, error) { /* … */ }
func (p *ConnPool) Put(conn net.Conn) { /* … */ }
var dbPool = NewConnPool(100, func() (net.Conn, error) { return net.Dial("tcp", "db:3306") })The standard library http.Client already embeds a connection pool; creating a new client per request should be avoided.
Case 5 – System‑level Tuning and Listener Configuration
Linux kernel parameters required for million‑connection servers:
File‑descriptor limit: ulimit -n 1048576 TCP buffers: net.core.rmem_max = 134217728 and net.core.wmem_max = 134217728 Reuse TIME_WAIT: net.ipv4.tcp_tw_reuse = 1 Connection queue: net.core.somaxconn = 65535 and net.ipv4.tcp_max_syn_backlog = 8192 Port range: net.ipv4.ip_local_port_range = 10000 65535 Listener can be created with SO_REUSEADDR and SO_REUSEPORT via net.ListenConfig.Control. Individual connections should enable TCP_NODELAY, keep‑alive and a reasonable keep‑alive period.
func createListener(addr string) (net.Listener, error) {
lc := net.ListenConfig{Control: func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEADDR, 1)
syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEPORT, 1)
})
})}
return lc.Listen(context.Background(), "tcp", addr)
}
func configConn(conn net.Conn) error {
if tcp, ok := conn.(*net.TCPConn); ok {
tcp.SetNoDelay(true)
tcp.SetKeepAlive(true)
tcp.SetKeepAlivePeriod(3 * time.Minute)
}
return nil
}5. Load‑Testing Practice
Server (echo) implementation
package main
import (
"bufio"
"flag"
"fmt"
"log"
"net"
"runtime"
"sync/atomic"
"time"
)
var (
addr = flag.String("addr", ":8080", "listen address")
conns int64
)
func main() {
flag.Parse()
ln, err := net.Listen("tcp", *addr)
if err != nil { log.Fatal(err) }
log.Printf("Listening on %s", *addr)
go monitor()
for {
conn, err := ln.Accept()
if err != nil { continue }
atomic.AddInt64(&conns, 1)
go handleConn(conn)
}
}
func handleConn(conn net.Conn) {
defer func(){ conn.Close(); atomic.AddInt64(&conns, -1) }()
if tcp, ok := conn.(*net.TCPConn); ok {
tcp.SetNoDelay(true)
tcp.SetKeepAlive(true)
tcp.SetKeepAlivePeriod(3 * time.Minute)
}
reader := bufio.NewReaderSize(conn, 4096)
writer := bufio.NewWriterSize(conn, 4096)
for {
conn.SetReadDeadline(time.Now().Add(5 * time.Minute))
line, err := reader.ReadBytes('
')
if err != nil { return }
writer.Write(line)
writer.Flush()
}
}
func monitor() {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Connections: %d, Goroutines: %d, Memory: %.2f GB
",
atomic.LoadInt64(&conns), runtime.NumGoroutine(), float64(m.Alloc)/1<<30)
}
}Client (stress) implementation
package main
import (
"flag"
"log"
"net"
"sync"
"sync/atomic"
"time"
)
var (
target = flag.String("target", "localhost:8080", "target server")
count = flag.Int("c", 100000, "number of connections")
rate = flag.Int("rate", 1000, "connections per second")
established int64
errors int64
)
func main() {
flag.Parse()
log.Printf("Target: %s, Connections: %d", *target, *count)
go monitor()
ticker := time.NewTicker(time.Second / time.Duration(*rate))
defer ticker.Stop()
var wg sync.WaitGroup
for i := 0; i < *count; i++ {
<-ticker.C
wg.Add(1)
go func(){
defer wg.Done()
if err := connect(); err != nil { atomic.AddInt64(&errors, 1) }
}()
}
wg.Wait()
select {}
}
func connect() error {
conn, err := net.DialTimeout("tcp", *target, 10*time.Second)
if err != nil { return err }
atomic.AddInt64(&established, 1)
go func(){
defer conn.Close()
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for range ticker.C {
conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
if _, err := conn.Write([]byte("ping
")); err != nil { return }
buf := make([]byte, 5)
conn.SetReadDeadline(time.Now().Add(5 * time.Second))
if _, err := conn.Read(buf); err != nil { return }
}
}()
return nil
}
func monitor() {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
log.Printf("Established: %d, Errors: %d", atomic.LoadInt64(&established), atomic.LoadInt64(&errors))
}
}Typical test steps:
Apply the kernel parameters listed in Case 5.
Run the server: go run server.go -addr :8080.
Run the client on another host:
go run client.go -target <em>server‑ip</em>:8080 -c 1000000 -rate 5000.
Observe server metrics (connections, goroutine count, memory) and client metrics (established connections, errors).
6. Core Takeaways
Advantages of Go’s Network Model
Developer‑friendly : Write straightforward synchronous code.
High performance : Runtime converts it to non‑blocking asynchronous execution.
Scalable : Same code works from a few connections up to millions.
Multi‑core aware : Scheduler automatically balances work across CPUs.
Key Factors for Million‑Connection Services
One goroutine per connection (avoid extra goroutine).
Buffer size 16‑32 KB for typical workloads.
Set appropriate read/write and idle timeouts.
Tune OS limits (ulimit, sysctl).
Reuse outbound connections with a pool.
Continuously monitor and load‑test.
Common Misconceptions
More goroutine always improves throughput – false.
Timeouts are optional – false.
Default OS/network settings are sufficient for millions of connections – false.
Skipping load‑testing is safe – false.
Optimisation Roadmap
Stage 1: Functional correctness – get the service running.
Stage 2: Single‑connection tuning – buffers, deadlines, error handling.
Stage 3: Concurrency tuning – goroutine count, connection pool.
Stage 4: System tuning – kernel parameters, monitoring.
Stage 5: Load‑test validation – identify bottlenecks, iterate.7. Final Thoughts
Understanding the netpoller lets you write high‑performance network services, quickly locate bottlenecks, and make informed architectural decisions. The initial failure was caused by treating Go like a thread‑per‑connection model and spawning unnecessary goroutine pairs. Keep the code simple, let the runtime handle asynchronous I/O, and you can reliably achieve million‑connection scalability.
Code Wrench
Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
