Operations 19 min read

Master Nginx Rate Limiting & Anti‑Crawler Techniques: A Complete Ops Engineer Guide

This guide walks operations engineers through the principles and practical configurations of Nginx rate limiting and anti‑crawler protection, covering token‑bucket and leaky‑bucket algorithms, IP and URI based limits, geo‑based controls, advanced User‑Agent filtering, JavaScript challenges, monitoring, performance tuning, and troubleshooting.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Nginx Rate Limiting & Anti‑Crawler Techniques: A Complete Ops Engineer Guide

Introduction

In today’s fast‑growing internet services, websites face traffic spikes and malicious crawlers. Operations engineers must protect normal users while blocking harmful traffic and crawler attacks. This article explores Nginx‑based rate limiting and anti‑crawler solutions, from theory to practice, providing a complete protection system.

Why Rate Limiting and Anti‑Crawler Are Needed

Business Pain Points

Traffic bursts cause server overload : sudden traffic spikes or CC attacks.

Malicious crawlers consume resources : frequent requests waste bandwidth and increase load.

Data leakage risk : sensitive information can be harvested in bulk.

User experience degradation : normal users experience slow or blocked access.

Technical Advantages of Using Nginx

High performance : event‑driven model handles tens of thousands of concurrent connections per server.

Low memory usage : consumes fewer resources than Apache and similar servers.

Modular design : rich third‑party modules enable extensive feature extensions.

Flexible configuration : supports complex rule definitions and dynamic updates.

Nginx Rate‑Limiting Core Principles

Token Bucket Algorithm

The ngx_http_limit_req_module implements rate limiting based on the token bucket algorithm. Its core ideas are:

The system adds tokens to the bucket at a constant rate.

Incoming requests must take a token from the bucket.

When the bucket is full, new tokens overflow.

If the bucket is empty, requests are rejected or delayed.

令牌桶示意图:
┌─────────────┐
│  Token Bucket │ ←── 恒定速率添加令牌
│  ○ ○ ○ ○ ○   │
│  ○ ○ ○       │
└─────────────┘
      ↓
   用户请求消耗令牌

Leaky Bucket Algorithm

The leaky bucket provides a constant output rate:

Requests enter the bucket and queue.

They are processed at a fixed rate.

If the bucket is full, new requests are dropped.

Basic Rate‑Limiting Configuration

3.1 IP‑Based Request Frequency Limiting

Common IP limiting configuration:

http {
    # Define limit zone based on client IP
    limit_req_zone $binary_remote_addr zone=ip_limit:10m rate=10r/s;

    # Define connection limit zone
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    server {
        listen 80;
        server_name example.com;

        location / {
            # Apply IP limit: 10 requests per second, burst up to 5
            limit_req zone=ip_limit burst=5 nodelay;
            # Limit maximum connections per IP to 10
            limit_conn conn_limit 10;
            # Custom response status for limit violations
            limit_req_status 429;
            limit_conn_status 429;
            proxy_pass http://backend;
        }

        error_page 429 /429.html;
        location = /429.html {
            root /var/www/html;
            internal;
        }
    }
}

Configuration notes: $binary_remote_addr: binary client IP format saves memory. zone=ip_limit:10m: allocates 10 MB for storing limit state. rate=10r/s: limits to 10 requests per second. burst=5: allows a short burst of 5 requests. nodelay: excess requests are rejected immediately.

3.2 URI‑Based Differential Limiting

Apply different limits to various interfaces:

http {
    # API rate limit
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=5r/s;
    # Static resources limit
    limit_req_zone $binary_remote_addr zone=static_limit:10m rate=50r/s;
    # Login interface strict limit
    limit_req_zone $binary_remote_addr zone=login_limit:10m rate=1r/s;

    server {
        listen 80;
        server_name api.example.com;

        location /api/ {
            limit_req zone=api_limit burst=2 nodelay;
            proxy_pass http://api_backend;
        }
        location ~* \.(jpg|jpeg|png|gif|css|js)$ {
            limit_req zone=static_limit burst=20;
            expires 1d;
            add_header Cache-Control "public, immutable";
        }
        location /api/login {
            limit_req zone=login_limit burst=1;
            access_log /var/log/nginx/login_limit.log combined;
            proxy_pass http://auth_backend;
        }
    }
}

3.3 Geo‑Based Rate Limiting

Combine the GeoIP2 module to limit traffic by country:

http {
    geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
        auto_reload 5m;
        $geoip2_metadata_country_build metadata build_epoch;
        $geoip2_data_country_code country iso_code;
        $geoip2_data_country_name country names en;
    }

    map $geoip2_data_country_code $country_limit_rate {
        default 10r/s;
        CN 20r/s;      # China higher limit
        US 15r/s;      # United States
        ~^(RU|UA)$ 5r/s; # Russia, Ukraine strict limit
    }

    limit_req_zone $binary_remote_addr zone=country_limit:10m rate=$country_limit_rate;

    server {
        listen 80;
        server_name global.example.com;
        location / {
            limit_req zone=country_limit burst=5;
            add_header X-Country-Code $geoip2_data_country_code;
            add_header X-Country-Name $geoip2_data_country_name;
            proxy_pass http://backend;
        }
    }
}

Advanced Anti‑Crawler Strategies

4.1 User‑Agent Detection and Filtering

http {
    map $http_user_agent $is_crawler {
        default 0;
        ~*bot 1;
        ~*spider 1;
        ~*crawler 1;
        ~*scraper 1;
        ~*python-requests 1;
        ~*curl 1;
        ~*wget 1;
        ~*scrapy 1;
        ~*beautifulsoup 1;
        "" 1;
        ~^.{0,10}$ 1;
    }
    map $http_user_agent $allowed_crawler {
        default 0;
        ~*googlebot 1;
        ~*bingbot 1;
        ~*baiduspider 1;
        ~*slurp 1;
    }

    server {
        listen 80;
        server_name example.com;
        location / {
            if ($is_crawler) { set $block_crawler 1; }
            if ($allowed_crawler) { set $block_crawler 0; }
            if ($block_crawler) { return 403; }
            proxy_pass http://backend;
        }
        location /robots.txt {
            root /var/www/html;
            add_header Cache-Control "public, max-age=3600";
        }
    }
}

4.2 Intelligent Detection Based on Request Features

http {
    limit_req_zone $binary_remote_addr zone=freq_check:10m rate=30r/s;
    map $http_referer $suspicious_referer {
        default 0;
        "" 1;
        "-" 1;
    }
    map "$http_accept:$http_accept_language:$http_accept_encoding" $suspicious_headers {
        default 0;
        ":::" 1;
        ~^[^:]*:[^:]*:$ 1;
    }
    server {
        listen 80;
        server_name example.com;
        location / {
            set $risk_score 0;
            if ($suspicious_referer) { set $risk_score "${risk_score}1"; }
            if ($suspicious_headers) { set $risk_score "${risk_score}1"; }
            if ($risk_score ~ "11") {
                access_log /var/log/nginx/suspicious.log combined;
                limit_req zone=freq_check burst=1 nodelay;
            }
            proxy_pass http://backend;
        }
    }
}

4.3 JavaScript Challenge Verification

http {
    lua_package_path "/usr/local/openresty/lualib/?.lua;;";
    lua_shared_dict challenge_cache 10m;
    server {
        listen 80;
        server_name secure.example.com;
        location /challenge {
            content_by_lua_block {
                local template = require "resty.template"
                local challenge = ngx.var.request_time .. ngx.var.remote_addr
                local hash = ngx.encode_base64(ngx.hmac_sha1("secret_key", challenge))
                local html = [[
<!DOCTYPE html>
<html>
<head><title>Verification Required</title><meta name="robots" content="noindex, nofollow"></head>
<body><h1>Verifying your browser...</h1>
<script>
    var result = Math.pow(2,3) + 5;
    var challenge = "{{challenge}}";
    setTimeout(function(){
        var form=document.createElement('form');
        form.method='POST';form.action='/verify';
        var c=document.createElement('input');c.type='hidden';c.name='challenge';c.value=challenge;
        var a=document.createElement('input');a.type='hidden';a.name='answer';a.value=result;
        form.appendChild(c);form.appendChild(a);document.body.appendChild(form);form.submit();
    },2000);
</script>
</body>
</html>
                ]]
                ngx.say(template.compile(html)({challenge=hash}))
            }
        }
        location /verify {
            content_by_lua_block {
                if ngx.var.request_method ~= "POST" then ngx.status=405; ngx.say("Method not allowed"); return end
                ngx.req.read_body()
                local args = ngx.req.get_post_args()
                if args.answer == "13" then
                    local cache = ngx.shared.challenge_cache
                    cache:set(ngx.var.remote_addr, "verified", 3600)
                    ngx.redirect("/")
                else
                    ngx.status=403; ngx.say("Verification failed")
                end
            }
        }
        location / {
            access_by_lua_block {
                local cache = ngx.shared.challenge_cache
                local verified = cache:get(ngx.var.remote_addr)
                if not verified then ngx.redirect("/challenge") end
            }
            proxy_pass http://backend;
        }
    }
}

Dynamic Protection and Monitoring

5.1 Real‑Time Monitoring and Alerts

http {
    log_format security_log '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $geoip2_data_country_code';
    vhost_traffic_status_zone;
    server {
        listen 80;
        server_name monitor.example.com;
        location / {
            access_log /var/log/nginx/security.log security_log;
            if ($limit_req_status = "503") { access_log /var/log/nginx/rate_limit.log security_log; }
            proxy_pass http://backend;
        }
        location /nginx_status {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            allow 192.168.0.0/16;
            deny all;
        }
    }
}

5.2 Automated Blacklist Management

#!/bin/bash
# auto_blacklist.sh – generate blacklist from security logs
LOG_FILE="/var/log/nginx/security.log"
BLACKLIST_FILE="/etc/nginx/conf.d/blacklist.conf"
TEMP_FILE="/tmp/nginx_blacklist.tmp"
awk -v date="$(date '+%d/%b/%Y:%H')" '$0 ~ date { ip=$1; if ($9=="429"||$9=="403") suspicious[ip]++; if ($10>10000) large[ip]++; if ($11<0.001) fast[ip]++; total[ip]++ } END { for (ip in suspicious) if (suspicious[ip]>100||large[ip]>50) print "deny " ip ";" }' $LOG_FILE > $TEMP_FILE
if [ -s $TEMP_FILE ]; then
    echo "# Auto‑generated blacklist - $(date)" > $BLACKLIST_FILE
    cat $TEMP_FILE >> $BLACKLIST_FILE
    nginx -t && nginx -s reload
    echo "Blacklist updated with $(wc -l < $TEMP_FILE) entries"
fi
rm -f $TEMP_FILE

Performance Optimization and Best Practices

6.1 Memory Usage Optimization

http {
    # Optimize memory for rate limiting
    limit_req_zone $binary_remote_addr zone=main_limit:50m rate=10r/s;
    map $request_uri $normalized_uri {
        ~^/api/v1/([^/]+) /api/v1/$1;
        ~^/static/ /static;
        default $request_uri;
    }
    limit_req_zone "$binary_remote_addr:$normalized_uri" zone=uri_limit:30m rate=20r/s;
    server {
        location / {
            limit_req zone=main_limit burst=10;
            limit_req zone=uri_limit burst=5;
            proxy_pass http://backend;
            proxy_cache my_cache;
            proxy_cache_valid 200 1m;
            proxy_cache_key "$scheme$proxy_host$normalized_uri";
        }
    }
}

6.2 Modular Configuration Files

# /etc/nginx/conf.d/rate_limits.conf
limit_req_zone $binary_remote_addr zone=global_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=5r/s;
limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=1r/s;

# /etc/nginx/conf.d/security_maps.conf
map $http_user_agent $is_malicious_bot { include /etc/nginx/maps/malicious_bots.map; }
map $geoip2_data_country_code $is_blocked_country { include /etc/nginx/maps/blocked_countries.map; }

# /etc/nginx/conf.d/security_headers.conf
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;

Troubleshooting and Debugging

7.1 Common Issue Diagnosis

# Verify rate‑limit is effective
curl -I http://example.com/api/test
# Rapidly send multiple requests to test limits
for i in {1..20}; do curl -s -o /dev/null -w "%{http_code}
" http://example.com/api/test; done
# View rate‑limit statistics
nginx -T | grep -A 10 limit_req_zone

7.2 Performance Monitoring Script

#!/bin/bash
check_nginx_performance() {
    echo "=== Nginx Performance Report ==="
    echo "Time: $(date)"
    echo "Active Connections:"
    ss -tln | grep :80 | wc -l
    echo -e "
Rate Limiting Status:"
    nginx -T 2>/dev/null | grep -c limit_req_zone
    echo -e "
Error Rate (Last 100 requests):"
    tail -100 /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c | sort -nr
    echo -e "
Nginx Memory Usage:"
    ps aux | grep nginx | grep -v grep | awk '{sum+=$6} END {print sum/1024 " MB"}'
}
check_nginx_performance
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsDevOpsanti‑crawler
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.