Operations 10 min read

Why Nginx Community Health Checks Fail and How to Diagnose Them

This article examines the weak health_check mechanism of open‑source Nginx, demonstrates a test setup with two Tomcat backends, analyzes access and error logs to show how Nginx retries failed servers, explains the role of max_fails, fail_timeout, and compares it with Nginx Plus features.

MaGe Linux Operations

May 19, 2018

Why Nginx Community Health Checks Fail and How to Diagnose Them

Many know that Nginx can act as a reverse proxy and load balancer, but few understand its health_check mechanism. The community edition provides only a thin health check using max_fails and fail_timeout in the upstream block.

Nginx Configuration

#worker_processes 1;

events {
    worker_connections 1024;
}

http {
    include mime.types;
    default_type application/octet-stream;
    log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for"';
    access_log logs/access.log main;
    sendfile on;
    keepalive_timeout 65;
    upstream backend {
        server localhost:9090 max_fails=1 fail_timeout=40s;
        server localhost:9191 max_fails=1 fail_timeout=40s;
    }
    server {
        listen 80;
        server_name localhost;
        location / {
            proxy_pass http://backend;
            proxy_connect_timeout 1;
            proxy_read_timeout 1;
        }
        error_page 500 502 503 504 /50x.html;
        location = /50x.html { root html; }
    }
}

The test environment uses CentOS 6.4, Nginx 1.6.0, and two Tomcat 8.0.15 instances as backends. One Tomcat is deliberately delayed (10 minutes) to simulate a server that is unavailable during startup.

access.log (initial request)

192.168.42.254 - - [29/Dec/2014:11:24:23 +0800] "GET /response/ HTTP/1.1" 504 537 720 380 "Mozilla/5.0 ..." 2.004 host:health.iflytek.com
192.168.42.254 - - [29/Dec/2014:11:24:24 +0800] "GET /favicon.ico HTTP/1.1" 502 537 715 311 "Mozilla/5.0 ..." 0.000 host:health.iflytek.com

Because both Tomcat servers are still starting, Nginx retries both backends and finally reports no live upstreams while connecting to upstream, which is a form of health checking.

error.log (initial request)

2014/12/29 11:24:22 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9090/response/", host: "health.iflytek.com"
2014/12/29 11:24:23 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9191/response/", host: "health.iflytek.com"
2014/12/29 11:24:24 [error] 6318#0: *4785892017 no live upstreams while connecting to upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /favicon.ico HTTP/1.1", upstream: "http://health/favicon.ico", host: "health.iflytek.com"

After 40 seconds the first Tomcat (port 9090) finishes starting. A new request shows a successful 200 response from the healthy server.

access.log (after 9090 is up)

192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 ..." 0.210 host:health.iflytek.com
192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /favicon.ico HTTP/1.1" 404 453 674 311 "Mozilla/5.0 ..." 0.212 host:health.iflytek.com

The client receives the expected response (9090), confirming that Nginx correctly retried the healthy backend.

Subsequent request (while 9191 still starting)

192.168.42.254 - - [29/Dec/2014:13:43:13 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 ..." 1.005 host:health.iflytek.com

The error log now contains another timeout for the still‑starting 9191 server, but the client still receives a successful response because Nginx retries the request on the healthy 9090 server.

Thus, fail_timeout=40s means that after a failure a server is marked unavailable for 40 seconds; after that period it will be tried again regardless of its actual state. This illustrates the weakness of the community edition’s health check—it merely delays retries without true health monitoring.

Commercial Nginx Plus or Alibaba’s Tengine provide richer health‑check features such as active probing, slow_start, and health_check zones, which can handle cache warm‑up and more reliable server state detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NGINX health-check upstream fail_timeout load_balancing

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.