Operations 6 min read

How Nginx’s max_fails and fail_timeout Really Work: A Hands‑On Demo

This article explains the meaning of Nginx upstream directives max_fails and fail_timeout, demonstrates their behavior with a PHP‑FPM test setup, and clarifies common misconceptions and best‑practice settings for reliable load balancing.

Efficient Ops
Efficient Ops
Efficient Ops
How Nginx’s max_fails and fail_timeout Really Work: A Hands‑On Demo

Many users ask how the Nginx upstream directives max_fails and fail_timeout control load‑balancing failures and downtime.

According to the official documentation, max_fails is the number of failed attempts to a server within the period defined by fail_timeout . By default, max_fails is 1, meaning a single failure within fail_timeout marks the server as unavailable and the request is forwarded to the next upstream.

The fail_timeout directive has two meanings:

When an upstream server is confirmed unavailable, it defines the time window for counting communication failures.

It also defines the duration the server is considered down.

By default, fail_timeout is 10 seconds.

To illustrate, a test environment was built with Nginx and two PHP‑FPM instances (upstream servers). Nginx forwards PHP requests to the PHP‑FPM pool via

fastcgi

. The upstream was left with the default configuration (

max_fails=1

,

fail_timeout=10s

).

Four requests were sent, and the logs showed round‑robin distribution to both PHP‑FPM instances. After stopping PHP‑FPM1, the next request (which should have gone to PHP‑FPM1) failed to connect; Nginx logged the failure once and then redirected the request to PHP‑FPM2.

When max_fails was increased to 2, the same steps were repeated. After stopping PHP‑FPM1 again, multiple requests were issued. All of them were served by PHP‑FPM2, and Nginx logged only a single failure for each attempt, respecting the new max_fails count.

After the fail_timeout period (10 s) elapsed, Nginx tried to contact the previously failed PHP‑FPM1 again. Two consecutive failures within the timeout caused the server to be marked down again, demonstrating that max_fails counts failures only within the fail_timeout window.

Common misconceptions:

Failing to connect to an upstream does not immediately return an error to the client; Nginx logs the failure and retries other healthy upstreams.

max_fails counts consecutive failures within fail_timeout , not a single failure followed by an immediate retry.

Guidelines for fail_timeout :

Setting it too short can cause frequent reconnection attempts to an unavailable server, consuming excessive TCP resources under high traffic.

Setting it too long can lead to load imbalance, as traffic may continue to be sent to a downed server for an extended period.

operationsLoad BalancingNginxPHP-FPMfail_timeoutmax_fails
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.