Operations 14 min read

Understanding Nginx Failure Retry Mechanism and Common Pitfalls

This article explains Nginx's built‑in failure retry mechanism, detailing how fails are defined via proxy_next_upstream, the default and custom error types, retry limits, backup servers, and common pitfalls with configuration examples and practical scenarios.

NetEase Game Operations Platform

Apr 4, 2020

Understanding Nginx Failure Retry Mechanism and Common Pitfalls

Background

Nginx, as a widely used reverse proxy, provides a failure retry mechanism to ensure service availability. The article uses simple examples to help readers understand this mechanism and avoid common mistakes.

How to Define Fails

The proxy_next_upstream directive determines which conditions are considered fails and trigger retries. Fails are divided into two categories:

Default errors : error and timeout.

Custom errors : invalid_header and various HTTP status codes.

Default Errors

error : occurs when a connection to the upstream server cannot be established, a request cannot be passed, or the response header cannot be read.

timeout : occurs when the connection, read, or send timeout is reached. Relevant directives include proxy_connect_timeout, proxy_read_timeout, and proxy_send_timeout.

Custom Errors

invalid_header : the upstream returns an empty or invalid response header, e.g., non‑standard HTTP or malformed headers.

NOTE: Only error and timeout are counted toward max_fails by default; adding other types via proxy_next_upstream includes them in the fail count.

Retry Mechanism Analysis

Basic retry scenario (no special proxy_next_upstream configuration) uses a simple upstream block with two servers. When one server fails, Nginx retries the request on the other server and increments the fail count. If max_fails is reached, the server is marked down for fail_timeout seconds.

upstream test {<br/>    server 127.0.0.1:8001 fail_timeout=60s max_fails=2; # Server A<br/>    server 127.0.0.1:8002 fail_timeout=60s max_fails=2; # Server B<br/>}

If all online servers become unavailable, Nginx returns a no live upstreams (502) error.

Retry Limiting

The parameters proxy_next_upstream_tries (default 0 = unlimited) and proxy_next_upstream_timeout (default 0 = unlimited) control the number of retries and the total retry time, respectively.

proxy_connect_timeout 3s;<br/>proxy_next_upstream_timeout 6s;<br/>proxy_next_upstream_tries 3;<br/><br/>upstream test {<br/>    server 127.0.0.1:8001 fail_timeout=60s max_fails=2;<br/>    server 127.0.0.1:8002 fail_timeout=60s max_fails=2;<br/>    server 127.0.0.1:8003;<br/>}

In the example, after 6 seconds the retry process stops and returns a 504 Gateway Timeout.

Backup Servers

Using the backup directive defines standby servers that are only used when all normal upstreams are down. Backup servers do not participate in normal retries.

upstream test {<br/>    server 127.0.0.1:8001 fail_timeout=60s max_fails=2; # Server A<br/>    server 127.0.0.1:8002 fail_timeout=60s max_fails=2; # Server B<br/>    server 127.0.0.1:8003 backup; # Server C (backup)<br/>}

Common Pitfalls

Retry not effective : Since Nginx 1.9.13, non‑idempotent methods (POST, PATCH, etc.) are not retried unless non_idempotent is added to proxy_next_upstream.

Disabling retries : Set proxy_next_upstream off to turn off all retries.

Performance issues : Overly aggressive retry settings (e.g., adding many error types) can cause frequent no live upstreams errors under load.

Response timeout retries : Long‑running idempotent requests may be retried unintentionally if proxy_read_timeout is too low.

Connection timeout retries : Insufficient proxy_connect_timeout can cause prolonged hangs; combine with proper retry limits.

Proper analysis of business requirements and careful configuration of retry parameters are essential to avoid these pitfalls.

backend operations Nginx failure-retry proxy_next_upstream load-balancing

Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.