Operations 32 min read

Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign

This postmortem explains how a Nginx connection‑saturation incident was initially misidentified as traffic surge, details the metrics and command‑line checks that revealed a connection‑lifecycle failure, and describes the step‑by‑step redesign of rate‑limiting, budgeting, monitoring, and run‑book procedures that restored stability.

Ops Community
Ops Community
Ops Community
Why Did Our Nginx Hit Connection Limits? A Deep Dive into Misdiagnosis and Rate‑Limiting Redesign

Guide

The core problem was summarized in a single sentence: we mistakenly treated the incident as a "traffic spike" while it was actually a "connection‑lifecycle governance failure". Mis‑identifying the symptom leads to the wrong actions – scaling machines, raising connection limits, and only later noticing upstream slow requests that keep connections from being released.

During the first 12 minutes the on‑call team observed rising 5xx errors, intermittent recoveries, and normal CPU usage, which led to the wrong hypothesis of application instance jitter. At minute 12 we confirmed that active connections were approaching the worker_connections limit and that the reading/writing and waiting ratios were abnormal, indicating a connection‑watermark approaching exhaustion.

[事故窗口 20:08~20:36]
QPS:              2.1w -> 2.8w -> 2.4w
5xx:              0.3% -> 8.9% -> 1.1%
P99:              180ms -> 5.8s -> 320ms
active_conn:      9.8k -> 15.7k -> 10.2k
accepted/s:       2.2k -> 5.6k -> 2.4k
closed/s:         2.1k -> 3.0k -> 2.3k
Conclusion:       establishment rate continuously higher than release rate, connection water level rising unidirectionally

We standardized a "first‑round three‑step" check to be executed within 30 seconds: fetch nginx_status, run ss -s, and tail the last 100 lines of the error log. The three judgment criteria are:

Is active_conn persistently near the upper limit?

Is accepted/s consistently higher than closed/s?

Are writing or waiting abnormally high?

Principle

The essence of Nginx connection incidents can be expressed as:

connection_watermark_change = establishment_rate - release_rate + idle_reservation - reclamation_efficiency

When upstream becomes slow, clients retry aggressively, keep‑alive is too permissive, and alerts only monitor CPU, the watermark keeps rising. Early symptoms mimic ordinary jitter – occasional timeouts, local 5xx, brief recoveries – until the connection limit is breached and the failure spreads from edge interfaces to core transaction paths.

A minimal sampling set must cover three dimensions: connection state, queue state, and upstream latency.

curl -s http://127.0.0.1/nginx_status
ss -ant | head -n 30
netstat -s | head -n 40

Decision guidance:

If active_conn rises and release cannot keep up, prioritize fixing connection residency.

If SYN_RECV spikes, investigate the accept queue and handshake pressure.

If writing is high while upstream is slow, focus on timeout and retry tuning.

Architecture

Before the incident the gateway architecture shared a single connection budget across all traffic and applied a one‑dimensional rate‑limit. During peak load, anonymous, retry, and login traffic competed for the same slots, causing non‑critical traffic to exhaust capacity and critical transactions to suffer.

We abstracted the old architecture as:

Unified entry (single connection pool for all paths)

Shared pool (no business‑level guarantees)

Alert lag (triggered by 5xx instead of connection watermark)

# Old architecture (pre‑incident)
Client → L4 SLB → Nginx Gateway (single pool)
  /api/public/*
  /api/order/*
  /api/user/*
  /api/pay/*
Risk: any traffic spike raises the overall connection watermark.

The problematic configuration was a rate‑limit without a concurrent‑limit and an oversized burst value, which caused short‑term spikes to accumulate in the gateway and, when combined with slow upstream, turned into long‑living connections.

# Pre‑incident simplified config
limit_req_zone $binary_remote_addr zone=req_per_ip:20m rate=120r/s;
server {
    listen 443 ssl;
    location /api/ {
        limit_req zone=req_per_ip burst=200 nodelay;
        proxy_connect_timeout 3s;
        proxy_read_timeout 60s;
    }
}

The redesigned target architecture emphasizes "budget first, dual‑dimensional rate‑limit, core guarantee, rollback‑able changes". We split connection capacity into total, reserved, and shared budgets and defined per‑traffic caps and degradation order.

connection_budget:
  gateway_total: 16000
  reserved:
    login: 3000
    order_submit: 2500
    payment_callback: 1200
  shared:
    public_api: 5800
    internal_api: 3500

degrade_order:
  - public_api
  - low_priority_query
  - async_callback

At the Nginx layer we combine concurrent‑connection limits, request‑rate limits, traffic‑class maps, and upstream timeout convergence into a single policy package. The goal is not minimal latency but ensuring controllable behavior during abnormal windows.

limit_conn_zone $binary_remote_addr zone=conn_ip:20m;
limit_req_zone  $binary_remote_addr zone=req_ip:20m rate=40r/s;

map $http_authorization $traffic_class {
    default anonymous;
    ~^Bearer login;
}

map $traffic_class $conn_cap {
    anonymous 15;
    login 80;
}

server {
    listen 443 ssl;
    location /api/public/ {
        limit_conn conn_ip 15;
        limit_req zone=req_ip burst=20 nodelay;
        proxy_read_timeout 8s;
    }
    location /api/order/ {
        limit_conn conn_ip 80;
        proxy_read_timeout 20s;
    }
}

Before deployment we added a "budget verification" step that checks worker_rlimit_nofile, ulimit, somaxconn, and alert thresholds; any mismatch aborts the release.

Deep Case Study

Minute‑level timeline (2026‑01‑18 20:07‑20:36 UTC+8) shows the exact actions, outputs, and decisions.

事故级别:P1
影响服务:网关、下单、查询、登录
峰值影响:Active 15.7k/16k,5xx 8.9%,P99 5.8s

20:07‑20:09 – Warm‑up, weak signal but already abnormal

curl -s http://127.0.0.1/nginx_status
ss -s
Active connections: 12241
Reading: 166 Writing: 2194 Waiting: 9881
TCP: inuse 12602 tw 21931

20:10‑20:12 – First mis‑diagnosis (treated as compute issue)

The on‑call team checked pod resources and GC, found no CPU spike, and performed an ineffective restart, wasting the golden 2‑minute window.

kubectl top pod -n prod
kubectl logs deploy/app-order -n prod --since=5m | head -n 50
CPU: 40%~52%,内存稳定
GC: 未见 Full GC 激增
Conclusion: Not a compute bottleneck, switch to connection view.

20:13‑20:15 – Evidence shifts to connection structure

for i in {1..5}; do
  date +%H:%M:%S
  curl -s http://127.0.0.1/nginx_status | sed -n '1,4p'
  sleep 10
done

Active connections rose from 13.6k to 14.9k, writing grew from 2.8k to 3.5k, confirming rapid establishment and slow release.

20:16‑20:18 – Confirm upstream slowdown and blocked release

tail -n 2000 /var/log/nginx/access.log | tail -n 20
tail -n 300 /var/log/nginx/error.log | tail -n 30
Key observations:
1) upstream_response_time P99 ↑ from 320ms to 4.2s
2) upstream timed out spikes within 2 min
3) writing stays > 3k

20:19‑20:21 – First round of bleeding control (anonymous traffic)

location /api/public/ {
    limit_conn conn_ip 15;
    limit_req zone=req_ip burst=20 nodelay;
    proxy_read_timeout 12s;
}
location /api/order/submit {
    limit_conn conn_ip 80;
    proxy_read_timeout 25s;
}
nginx -t
nginx -s reload
curl -s http://127.0.0.1/nginx_status

20:22‑20:25 – Second round (tighten reclamation)

keepalive_timeout 20s;
keepalive_requests 500;
reset_timedout_connection on;
client_header_timeout 10s;
client_body_timeout 10s;
send_timeout 12s;
nginx -t
nginx -s reload
ss -s

20:26‑20:30 – Observation window (no further changes)

for i in {1..4}; do
  date +%H:%M
  curl -s http://127.0.0.1/nginx_status | sed -n '1,4p'
  sleep 60
done
20:27 Active 15102
20:28 Active 14680
20:29 Active 13911
20:30 Active 13126
Trend: connection watermark begins to fall.

20:31‑20:36 – Recovery and verification

curl -s http://127.0.0.1/nginx_status
ss -s
tail -n 50 /var/log/nginx/error.log
5xx: 8.9% -> 1.1%
P99: 5.8s -> 320ms
Core transaction path not killed.

Three key decision points:

First pivot: from application view to connection view.

Second pivot: stop bleeding before full recovery.

Third pivot: change only one variable per round and keep a verification window.

Small Cases

Three concise templates demonstrate the "symptom‑command‑fix‑verify" workflow.

Case 1 – Short‑connection storm

active_conn: 10.8k -> 14.9k
accept_rate: 2.1k/s -> 6.2k/s
close_rate: 2.0k/s -> 3.1k/s
TIME_WAIT: 1.2w -> 4.8w
# Investigation
sar -n TCP,ETCP 1 10
# Fix
keepalive_timeout 20s;
keepalive_requests 2000;
# Verify
python3 check_metrics.py --metric conn_create_rate --window 10m

Case 2 – Upstream slow causing connection hold

QPS: 2.2w -> 2.3w (stable)
active_conn: 11.2k -> 15.4k
writing: 900 -> 4200
upstream_p99: 0.4s -> 6.2s
# Investigation
awk '{print $(NF-2)}' /var/log/nginx/upstream_timing.log | sort -n | tail -n 20
# Fix
proxy_read_timeout 3s;
proxy_next_upstream error timeout http_502 http_503 http_504;
# Verify
python3 check_metrics.py --metric upstream_p99 --window 15m

Case 3 – Threshold mis‑configuration

limit_conn hit rate: 0.2% -> 7.8%
HTTP 499: 0.4% -> 5.6%
Core order success: 99.3% -> 93.1%
# Investigation
grep -n "limiting connections" /var/log/nginx/error.log | tail -n 30
# Fix
map $request_uri $conn_cap {
    default 30;
    ~^/api/order/create 200;
    ~^/api/payment 200;
}
location /api/ { limit_conn conn_ip $conn_cap; }
# Verify
python3 check_metrics.py --metric core_api_success_rate --window 10m

Engineering Checklist

Monitoring & Alert List

monitoring:
  must_have:
    - active_connections
    - conn_create_rate
    - conn_release_rate
    - upstream_p99
    - status_5xx_rate
    - time_wait_count
alerts:
  - name: conn_watermark_high
    rule: active_connections > 0.85 * conn_budget for 3m
    level: P1
  - name: conn_turnover_degrade
    rule: conn_create_rate > conn_release_rate for 5m
    level: P1

Release Gate Checklist

# Pre‑release checks
nginx -t
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
cat /proc/$(cat /run/nginx.pid)/limits | grep -i "open files"
python3 lint_nginx_limit.py /etc/nginx/nginx.conf
python3 precheck_conn_budget.py --service edge-gw
# Gate criteria
- Config syntax passes
- backlog matches kernel parameters
- FD budget margin >= 20%
- Rate‑limit policy includes high/low tiers

On‑call Runbook

# Step 1
curl -s http://127.0.0.1:18080/nginx_status
ss -s
# Step 2 – top error codes
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head
# Step 3 – backup & reload
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.rollback
nginx -t && nginx -s reload

Action Items

Short‑term (1‑2 weeks): launch tiered rate‑limit policy (Owner: sre‑li, DDL: 2026‑03‑05, acceptance: 5xx peak ↓ 40%).

Short‑term: ship connection‑watermark dashboard (Owner: obs‑zhang, DDL: 2026‑03‑08, acceptance: 8 metrics covered).

Mid‑term (1‑2 months): upstream slow‑query remediation (Owner: app‑wang, DDL: 2026‑04‑20, acceptance: upstream P99 < 300 ms).

Mid‑term: platform‑wide tiered‑limit rollout (Owner: platform‑chen, DDL: 2026‑04‑30, acceptance: one‑click gray‑scale & rollback).

Long‑term (quarter+): incorporate connection‑budget review into change‑approval process (Owner: arch‑liu, DDL: 2026‑06‑30, acceptance: 100 % of new services pass budget review).

Pitfalls

Only watching CPU hides connection‑turnover problems – the early signal is "establishment rate > release rate".

Raising worker_connections alone does not solve slow upstream; you must also improve upstream latency and apply concurrent limits.

Treating limit_req as a cure‑all ignores the need for a concurrent‑limit dimension.

Alerting solely on 5xx misses the process indicators; combine capacity, turnover, and business impact metrics.

Never skip rollback rehearsals – a simple cp nginx.conf.rollback && nginx -t && nginx -s reload should be exercised monthly.

Conclusion

The incident taught us that the real change was not a single parameter tweak but the establishment of a repeatable method:

First, verify whether connections are in net inflow.

Then locate whether the problem is "establishment too fast" or "release too slow".

Next decide bleeding‑control versus root‑cause remediation actions.

Finally codify the experience into gate checks, alerts, rehearsals, and review mechanisms.

We can express the stability formula as:

Stability = connection_budget_governance × turnover_observability × tiered_rate_limit × upstream_latency_governance

If any factor stays at zero for a long period, a future peak will inevitably trigger a failure.

Next three actionable items (to be completed this week):

Complete the connection‑watermark and turnover dashboard (run python3 create_dashboard.py --template nginx_conn_governance).

Roll out the tiered rate‑limit policy to 10 % of traffic (

python3 rollout_limit_policy.py --service edge-gw --percent 10

).

Conduct a 30‑minute connection‑failure drill (

python3 run_drill.py --scenario nginx_conn_saturation --duration 30m

).

Monitoringincident responseNginxrate limitingconnection limits
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.