How Nginx’s max_fails and fail_timeout Really Work: A Hands‑On Demo
This article explains the meaning of Nginx upstream directives max_fails and fail_timeout, shows their default values, walks through a step‑by‑step experiment with two PHP‑FPM backends, and clarifies common misconceptions about failure handling and timeout settings.
Many developers ask how the Nginx upstream directives max_fails and fail_timeout control failure attempts and the unavailability period for upstream (backend) servers.
The official documentation states that max_fails is the number of failed connections allowed within the fail_timeout window; the default is 1, meaning a single failure within the timeout marks the server as down. The default fail_timeout is 10 seconds.
fail_timeout has two related meanings: (1) the time window during which failed connection attempts are counted, and (2) the duration the server is considered unavailable after reaching the failure limit.
Experimental environment
Nginx
Two PHP‑FPM instances (upstream servers)
The upstream is configured with the defaults max_fails=1 and fail_timeout=10s. Nginx forwards PHP requests to the PHP‑FPM backends via FastCGI.
Step‑by‑step observations
1. Monitor the logs of both PHP‑FPM instances with tail -f and send four requests. Because the default load‑balancing method is round‑robin, the requests are distributed evenly between the two backends.
2. Stop PHP‑FPM1 and send another request. Nginx logs a single failure for PHP‑FPM1, then immediately forwards the request to PHP‑FPM2, demonstrating that only one failure attempt is made before the request is retried on another upstream.
3. Change max_fails to 2 and repeat the request sequence. The behavior remains the same: after the first failure, Nginx retries the request on the healthy backend.
4. Restart PHP‑FPM1; subsequent requests again follow the round‑robin distribution.
5. Keep PHP‑FPM1 stopped and issue many requests. Because fail_timeout is 10 seconds, Nginx continues to send traffic to PHP‑FPM2. After the 10‑second timeout expires, Nginx attempts to contact PHP‑FPM1 again; after two consecutive failures within the timeout, the upstream remains marked down until the timeout period passes.
Common misconceptions
Nginx does not immediately return an error when an upstream connection fails; it logs the failure and retries the request on another available upstream. Only when all upstreams are down does Nginx return an error response. max_fails counts the number of failed attempts within the fail_timeout window, not a single request retry. The request still follows the configured load‑balancing algorithm.
A shorter fail_timeout reduces the period an unhealthy server is considered down but may cause frequent reconnection attempts, consuming TCP resources under high traffic.
A longer fail_timeout reduces reconnection overhead but can lead to load imbalance, as traffic may stay on the healthy servers while the downed server remains excluded.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
