Backend Development 14 min read

Root Cause Analysis of Connection Reset by Peer in a Go Backend Service

This article details a production incident where a Go backend service returned "connection reset by peer" due to exhausted process file descriptors caused by a saturated database connection pool, and describes the step‑by‑step troubleshooting, socket internals, and the eventual fix.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Root Cause Analysis of Connection Reset by Peer in a Go Backend Service

Background

A colleague reported that a client request to a core service was repeatedly failing with "connection reset by peer" in production, prompting an urgent high‑priority investigation.

1. Immediate Damage Control

The issue affected only a subset of instances, suggesting an environmental or resource problem. The team first verified that a simple restart restored service availability, so OPS performed a rapid restart while preserving the faulty instances for later analysis.

2. Problem Identification

Steps taken:

Verified that the client request consistently reproduced the error on the problematic instance.

Checked logs and found no access log entries, only the reset error.

Used curl -v 'http://10.xx.xx.35:2133/xx/xx/checkalive' to reproduce the issue.

Ran tail -f ./log/xxx.log to confirm lack of logging.

Captured packets with tcpdump (e.g., /usr/sbin/tcpdump -i eth0 -n -nn host 10.xx.xx.35 ) and observed that the three‑way handshake completed, but the server responded with a reset instead of an ACK during data transmission.

3. Socket Workflow Analysis

The server uses Go's net/http package, which ultimately calls net/tcpsock.go and internal/poll/fd_unix.go to accept connections. When the accept call fails—commonly due to the accept queue being full—the kernel sends a reset to the client.

net/http/server.go func (srv *Server) Serve(l net.Listener) error net/tcpsock.go func (l *TCPListener) AcceptTCP() (*TCPConn, error) net/tcpsock_posix.go func (ln *TCPListener) accept() (*TCPConn, error) net/fd_unix.go func (fd *netFD) accept() (netfd *netFD, err error) internal/poll/fd_unix.go func (fd *FD) Accept() (int, syscall.Sockaddr, string, error) internal/poll/sock_cloexec.go func accept(s int) (int, syscall.Sockaddr, string, error)

4. Resource Exhaustion Check

Using netstat -an | grep port , ss -ant | grep port , lsof -p port and ulimit -a , the team discovered that the process file‑descriptor limit (10240) had been reached. Many connections were in CLOSE_WAIT state, but the dominant factor was the exhausted descriptor count.

5. CLOSE_WAIT Analysis

Excessive CLOSE_WAIT connections indicated that the server was not completing the four‑way handshake termination, often because the application held the socket open after the client closed, leading to descriptor leakage.

6. Root Cause – Database Connection Pool Saturation

Monitoring revealed a sudden increase in request latency, traced to a DB query slowdown. The Go DB connection pool reached its maximum open connections, causing requests to queue, consume more descriptors, and eventually fill the process descriptor table, resulting in the reset behavior.

Relevant Go struct:

type DBStats struct {
    MaxOpenConnections int
    OpenConnections    int
    InUse             int
    Idle              int
    WaitCount         int64
    WaitDuration      time.Duration
    MaxIdleClosed     int64
    MaxLifetimeClosed int64
}

7. Fix and Verification

Increasing the DB pool's maximum connections eliminated the bottleneck; load testing reproduced the issue when the limit was low, confirming the hypothesis.

8. Lessons Learned

Promptly restart services to stop damage while preserving state for analysis.

Maintain comprehensive logs and metrics (CPU, memory, descriptors, latency) for post‑mortem.

Monitor DB connection pool health and adjust limits according to traffic growth.

Understand TCP state transitions and be proficient with Linux tools (curl, tcpdump, netstat, ss, lsof, ulimit).

Implement robust alerting for descriptor usage and CLOSE_WAIT spikes.

Common Diagnostic Commands

curl -v 'http://10.xx.xx.35:21xx/xx/xx/checkalive'
whereis tcpdump
ifconfig
/usr/sbin/tcpdump -i eth0 -n -nn host 10.xx.xx.35
netstat -an | grep xxxx
ps -ef | grep xxx
lsof -p xxx
ulimit -a
pmap -x xxx
cat /proc/$pid/smaps
strace -p $pid
pstack $pid
ls /proc/$pid/fd/ | wc -l
GoNetwork TroubleshootingTCPDatabase Connection Poolbackend debuggingLinux toolsConnection Reset
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.