How Nginx’s max_fails and fail_timeout Really Work: A Hands‑On Demo
This article explains the meaning of Nginx upstream directives max_fails and fail_timeout, demonstrates their behavior with a PHP‑FPM test setup, and clarifies common misconceptions and best‑practice settings for reliable load balancing.
Many users ask how the Nginx upstream directives max_fails and fail_timeout control load‑balancing failures and downtime.
According to the official documentation, max_fails is the number of failed attempts to a server within the period defined by fail_timeout . By default, max_fails is 1, meaning a single failure within fail_timeout marks the server as unavailable and the request is forwarded to the next upstream.
The fail_timeout directive has two meanings:
When an upstream server is confirmed unavailable, it defines the time window for counting communication failures.
It also defines the duration the server is considered down.
By default, fail_timeout is 10 seconds.
To illustrate, a test environment was built with Nginx and two PHP‑FPM instances (upstream servers). Nginx forwards PHP requests to the PHP‑FPM pool via
fastcgi. The upstream was left with the default configuration (
max_fails=1,
fail_timeout=10s).
Four requests were sent, and the logs showed round‑robin distribution to both PHP‑FPM instances. After stopping PHP‑FPM1, the next request (which should have gone to PHP‑FPM1) failed to connect; Nginx logged the failure once and then redirected the request to PHP‑FPM2.
When max_fails was increased to 2, the same steps were repeated. After stopping PHP‑FPM1 again, multiple requests were issued. All of them were served by PHP‑FPM2, and Nginx logged only a single failure for each attempt, respecting the new max_fails count.
After the fail_timeout period (10 s) elapsed, Nginx tried to contact the previously failed PHP‑FPM1 again. Two consecutive failures within the timeout caused the server to be marked down again, demonstrating that max_fails counts failures only within the fail_timeout window.
Common misconceptions:
Failing to connect to an upstream does not immediately return an error to the client; Nginx logs the failure and retries other healthy upstreams.
max_fails counts consecutive failures within fail_timeout , not a single failure followed by an immediate retry.
Guidelines for fail_timeout :
Setting it too short can cause frequent reconnection attempts to an unavailable server, consuming excessive TCP resources under high traffic.
Setting it too long can lead to load imbalance, as traffic may continue to be sent to a downed server for an extended period.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.