Root Cause Analysis of Connection Reset by Peer in a Go Backend Service
This article details a production incident where a Go backend service returned "connection reset by peer" due to exhausted process file descriptors caused by a saturated database connection pool, and describes the step‑by‑step troubleshooting, socket internals, and the eventual fix.
Background
A colleague reported that a client request to a core service was repeatedly failing with "connection reset by peer" in production, prompting an urgent high‑priority investigation.
1. Immediate Damage Control
The issue affected only a subset of instances, suggesting an environmental or resource problem. The team first verified that a simple restart restored service availability, so OPS performed a rapid restart while preserving the faulty instances for later analysis.
2. Problem Identification
Steps taken:
Verified that the client request consistently reproduced the error on the problematic instance.
Checked logs and found no access log entries, only the reset error.
Used curl -v 'http://10.xx.xx.35:2133/xx/xx/checkalive' to reproduce the issue.
Ran tail -f ./log/xxx.log to confirm lack of logging.
Captured packets with tcpdump (e.g., /usr/sbin/tcpdump -i eth0 -n -nn host 10.xx.xx.35 ) and observed that the three‑way handshake completed, but the server responded with a reset instead of an ACK during data transmission.
3. Socket Workflow Analysis
The server uses Go's net/http package, which ultimately calls net/tcpsock.go and internal/poll/fd_unix.go to accept connections. When the accept call fails—commonly due to the accept queue being full—the kernel sends a reset to the client.
net/http/server.go func (srv *Server) Serve(l net.Listener) error net/tcpsock.go func (l *TCPListener) AcceptTCP() (*TCPConn, error) net/tcpsock_posix.go func (ln *TCPListener) accept() (*TCPConn, error) net/fd_unix.go func (fd *netFD) accept() (netfd *netFD, err error) internal/poll/fd_unix.go func (fd *FD) Accept() (int, syscall.Sockaddr, string, error) internal/poll/sock_cloexec.go func accept(s int) (int, syscall.Sockaddr, string, error)
4. Resource Exhaustion Check
Using netstat -an | grep port , ss -ant | grep port , lsof -p port and ulimit -a , the team discovered that the process file‑descriptor limit (10240) had been reached. Many connections were in CLOSE_WAIT state, but the dominant factor was the exhausted descriptor count.
5. CLOSE_WAIT Analysis
Excessive CLOSE_WAIT connections indicated that the server was not completing the four‑way handshake termination, often because the application held the socket open after the client closed, leading to descriptor leakage.
6. Root Cause – Database Connection Pool Saturation
Monitoring revealed a sudden increase in request latency, traced to a DB query slowdown. The Go DB connection pool reached its maximum open connections, causing requests to queue, consume more descriptors, and eventually fill the process descriptor table, resulting in the reset behavior.
Relevant Go struct:
type DBStats struct {
MaxOpenConnections int
OpenConnections int
InUse int
Idle int
WaitCount int64
WaitDuration time.Duration
MaxIdleClosed int64
MaxLifetimeClosed int64
}7. Fix and Verification
Increasing the DB pool's maximum connections eliminated the bottleneck; load testing reproduced the issue when the limit was low, confirming the hypothesis.
8. Lessons Learned
Promptly restart services to stop damage while preserving state for analysis.
Maintain comprehensive logs and metrics (CPU, memory, descriptors, latency) for post‑mortem.
Monitor DB connection pool health and adjust limits according to traffic growth.
Understand TCP state transitions and be proficient with Linux tools (curl, tcpdump, netstat, ss, lsof, ulimit).
Implement robust alerting for descriptor usage and CLOSE_WAIT spikes.
Common Diagnostic Commands
curl -v 'http://10.xx.xx.35:21xx/xx/xx/checkalive'
whereis tcpdump
ifconfig
/usr/sbin/tcpdump -i eth0 -n -nn host 10.xx.xx.35
netstat -an | grep xxxx
ps -ef | grep xxx
lsof -p xxx
ulimit -a
pmap -x xxx
cat /proc/$pid/smaps
strace -p $pid
pstack $pid
ls /proc/$pid/fd/ | wc -l360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.