Root Cause Analysis of Dubbo Connect Timeout in High‑Concurrency Scenarios and Backlog Tuning
This article presents a detailed case study of intermittent Dubbo connect‑timeout errors in a high‑concurrency deployment, describing step‑by‑step diagnostics—from port status checks and registry verification to TCP dump analysis—and explains how adjusting the server’s backlog and accept queue resolved the SYN‑drop issue.
Problem Background
A core service in CTrip’s vacation division runs on 80 Docker containers (4C8G each) across two data centers, serving over 1,300 client machines. After switching from HTTP to TCP via CDubbo, occasional client connect‑timeout errors appeared during deployments.
Investigation Steps
1. Port Opening Check : Verified that CDubbo opens ports synchronously before registration, ruling out asynchronous port opening.
2. Registry Push Verification : Examined Dubbo logs; port opening timestamps (e.g., 16:57:19) preceded failed connections (16:57:51), indicating the registry was not at fault.
3. Port Closure Hypothesis : Added shell script to poll port status every second. The port remained in LISTEN state, so it was not being closed unexpectedly.
4. Accept‑Log Enhancement : Inserted logging at Netty’s channelConnected to capture when connections were accepted. Logs showed some connections were accepted while others were rejected.
5. Server‑Side TCP Dump : Captured packets on the server; SYN packets arrived but no ACK was sent, suggesting loss before the kernel’s accept handling.
6. Container‑Side Connectivity Test : Ran a Bash script inside the container to repeatedly telnet to the service port. Intermittent failures (output “0”) confirmed that SYN packets were being dropped inside the container as well.
#!/bin/bash
for i in `seq 1 3600`
do
t=`timeout 0.1 telnet localhost 20xxx
&1 | grep -c 'Escape character is'`
echo $(date) "20xxx check result:" $t
sleep 0.005
done7. SYN Queue Overflow Analysis : netstat -s showed 3220 listen‑queue overflows and 3220 SYN drops, indicating the accept queue was saturated while the SYN queue remained empty.
8. Backlog Configuration : Examined ss -lnt output; the accept queue size was 50, far below the kernel’s somaxconn (128). Netty 3 defaults to a backlog of 50, whereas Netty 4 uses 1024.
9. Backlog Adjustment Experiment : Tested various backlog values on an 8‑core server with 10 client containers. Results:
Backlog
Connections/s
SYN Drop?
128
3000
No
128
5000
Few
1024
5000
No
1024
10000
No
Increasing the backlog to 1024 eliminated SYN drops even at 10,000 connections per second.
Conclusion
The connect‑timeout issue was caused by the server’s accept queue being full, leading the kernel to drop incoming SYN packets. Adjusting the Netty backlog (and consequently the accept queue) to a higher value resolved the problem, demonstrating the importance of proper socket backlog tuning in high‑concurrency backend services.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.