Master Nginx Rate Limiting & Anti‑Crawler: Complete Guide with Token Bucket, GeoIP, Lua & JS Challenges
This comprehensive guide explains why modern web services need rate limiting and anti‑crawler protection, compares token‑bucket and leaky‑bucket algorithms, and provides step‑by‑step Nginx configurations for IP, URI, and geographic throttling, advanced user‑agent filtering, JavaScript challenges, real‑time monitoring, performance tuning, and troubleshooting.
Why Rate Limiting and Anti‑Crawler Are Needed
In today’s fast‑growing internet services, websites face traffic spikes and malicious crawlers that can overload servers, waste bandwidth, expose sensitive data, and degrade user experience.
Business Pain Points
Sudden traffic surges or CC attacks cause server overload.
Malicious crawlers consume resources and bandwidth.
Data leakage risk from bulk scraping.
User experience degrades when legitimate users face slow or blocked access.
Why Choose Nginx
High performance with event‑driven architecture.
Low memory footprint compared with Apache.
Modular design with many third‑party extensions.
Flexible configuration and dynamic updates.
Nginx Rate‑Limiting Core Principle: Token Bucket
The ngx_http_limit_req_module implements the token‑bucket algorithm, which works as follows:
The system adds tokens to the bucket at a constant rate.
Each request consumes a token.
When the bucket is full, new tokens overflow.
If the bucket is empty, the request is rejected or delayed.
令牌桶示意图:
┌─────────────┐
│ Token Bucket │ ←── 恒定速率添加令牌
│ ○ ○ ○ ○ ○ │
│ ○ ○ ○ │
└─────────────┘
↓
用户请求消耗令牌Leaky Bucket Algorithm
The leaky‑bucket algorithm processes requests at a fixed output rate, queuing excess requests and discarding them when the bucket is full.
Requests enter the bucket queue.
Processed at a constant rate.
When full, new requests are dropped.
Basic Rate‑Limiting Configuration (IP‑Based)
http {
# Define IP‑based limit zone
limit_req_zone $binary_remote_addr zone=ip_limit:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server {
listen 80;
server_name example.com;
location / {
limit_req zone=ip_limit burst=5 nodelay;
limit_conn conn_limit 10;
limit_req_status 429;
limit_conn_status 429;
proxy_pass http://backend;
}
error_page 429 /429.html;
location = /429.html {
root /var/www/html;
internal;
}
}
}Configuration Explanation
$binary_remote_addr: stores client IP in binary form to save memory. zone=ip_limit:10m: allocates 10 MB for rate‑limit state. rate=10r/s: limits to 10 requests per second. burst=5: allows a short burst of 5 requests. nodelay: rejects excess requests immediately.
URI‑Based Differential Rate Limiting
http {
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=5r/s;
limit_req_zone $binary_remote_addr zone=static_limit:10m rate=50r/s;
limit_req_zone $binary_remote_addr zone=login_limit:10m rate=1r/s;
server {
listen 80;
server_name api.example.com;
location /api/ {
limit_req zone=api_limit burst=2 nodelay;
proxy_pass http://api_backend;
}
location ~* \.(jpg|jpeg|png|gif|css|js)$ {
limit_req zone=static_limit burst=20;
expires 1d;
add_header Cache-Control "public, immutable";
}
location /api/login {
limit_req zone=login_limit burst=1;
access_log /var/log/nginx/login_limit.log combined;
proxy_pass http://auth_backend;
}
}
}Geolocation‑Based Rate Limiting
http {
geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
auto_reload 5m;
$geoip2_data_country_code country iso_code;
$geoip2_data_country_name country names en;
}
map $geoip2_data_country_code $country_limit_rate {
default 10r/s;
CN 20r/s; # Higher limit for China
US 15r/s; # US limit
~^(RU|UA)$ 5r/s; # Strict limit for Russia, Ukraine
}
limit_req_zone $binary_remote_addr zone=country_limit:10m rate=$country_limit_rate;
server {
listen 80;
server_name global.example.com;
location / {
limit_req zone=country_limit burst=5;
add_header X-Country-Code $geoip2_data_country_code;
add_header X-Country-Name $geoip2_data_country_name;
proxy_pass http://backend;
}
}
}Advanced Anti‑Crawler Strategies
User‑Agent Detection and Filtering
http {
map $http_user_agent $is_crawler {
default 0;
~*bot 1;
~*spider 1;
~*crawler 1;
~*scraper 1;
~*python-requests 1;
~*curl 1;
~*wget 1;
~*scrapy 1;
~*beautifulsoup 1;
"" 1;
~^.{0,10}$ 1;
}
map $http_user_agent $allowed_crawler {
default 0;
~*googlebot 1;
~*bingbot 1;
~*baiduspider 1;
~*slurp 1;
}
server {
listen 80;
server_name example.com;
location / {
if ($is_crawler) { set $block_crawler 1; }
if ($allowed_crawler) { set $block_crawler 0; }
if ($block_crawler) { return 403; }
proxy_pass http://backend;
}
}
}Intelligent Request‑Feature Detection
http {
limit_req_zone $binary_remote_addr zone=freq_check:10m rate=30r/s;
map $http_referer $suspicious_referer {
default 0;
"" 1;
"-" 1;
}
map "$http_accept:$http_accept_language:$http_accept_encoding" $suspicious_headers {
default 0;
":::" 1;
~^[^:]*:[^:]*:$ 1;
}
server {
listen 80;
server_name example.com;
location / {
set $risk_score 0;
if ($suspicious_referer) { set $risk_score "${risk_score}1"; }
if ($suspicious_headers) { set $risk_score "${risk_score}1"; }
if ($risk_score ~ "11") {
access_log /var/log/nginx/suspicious.log combined;
limit_req zone=freq_check burst=1 nodelay;
}
proxy_pass http://backend;
}
}
}JavaScript Challenge Verification (Lua)
http {
lua_package_path "/usr/local/openresty/lualib/?.lua;;";
lua_shared_dict challenge_cache 10m;
server {
listen 80;
server_name secure.example.com;
location /challenge {
content_by_lua_block {
local template = require "resty.template"
local challenge = ngx.var.request_time .. ngx.var.remote_addr
local hash = ngx.encode_base64(ngx.hmac_sha1("secret_key", challenge))
local html = [[
<!DOCTYPE html>
<html>
<head><title>Verification Required</title><meta name="robots" content="noindex, nofollow"></head>
<body><h1>Verifying your browser...</h1>
<script>
var result = Math.pow(2,3) + 5;
var challenge = "{{challenge}}";
setTimeout(function(){
var form=document.createElement('form');
form.method='POST';form.action='/verify';
var c=document.createElement('input');c.type='hidden';c.name='challenge';c.value=hash;
var a=document.createElement('input');a.type='hidden';a.name='answer';a.value=result;
form.appendChild(c);form.appendChild(a);document.body.appendChild(form);form.submit();
},2000);
</script>
</body>
</html>
]]
ngx.say(template.compile(html)({challenge=hash}))
}
}
location /verify {
content_by_lua_block {
if ngx.var.request_method ~= "POST" then ngx.status=405; ngx.say("Method not allowed"); return end
ngx.req.read_body()
local args = ngx.req.get_post_args()
if args.answer == "13" then
ngx.shared.challenge_cache:set(ngx.var.remote_addr, "verified", 3600)
ngx.redirect("/")
else
ngx.status=403; ngx.say("Verification failed")
end
}
}
location / {
access_by_lua_block {
local cache = ngx.shared.challenge_cache
if not cache:get(ngx.var.remote_addr) then ngx.redirect("/challenge") end
}
proxy_pass http://backend;
}
}
}Dynamic Protection and Monitoring
Real‑Time Monitoring and Alerts
http {
log_format security_log '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $geoip2_data_country_code';
vhost_traffic_status_zone;
server {
listen 80;
server_name monitor.example.com;
location / {
access_log /var/log/nginx/security.log security_log;
if ($limit_req_status = "503") { access_log /var/log/nginx/rate_limit.log security_log; }
proxy_pass http://backend;
}
location /nginx_status {
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
allow 10.0.0.0/8;
allow 172.16.0.0/12;
allow 192.168.0.0/16;
deny all;
}
}
}Automated Blacklist Management
#!/bin/bash
# auto_blacklist.sh – generate deny rules from security.log
LOG_FILE="/var/log/nginx/security.log"
BLACKLIST_FILE="/etc/nginx/conf.d/blacklist.conf"
TEMP_FILE="/tmp/nginx_blacklist.tmp"
awk -v date="$(date '+%d/%b/%Y:%H')" '$0 ~ date { ip=$1; if ($9=="429"||$9=="403") suspicious[ip]++; if ($10>10000) large[ip]++; if ($11<0.001) fast[ip]++; total[ip]++ }
END { for (ip in suspicious) if (suspicious[ip]>100||large[ip]>50) print "deny " ip ";" }' $LOG_FILE > $TEMP_FILE
if [ -s $TEMP_FILE ]; then
echo "# Auto‑generated blacklist – $(date)" > $BLACKLIST_FILE
cat $TEMP_FILE >> $BLACKLIST_FILE
nginx -t && nginx -s reload
echo "Blacklist updated with $(wc -l < $TEMP_FILE) entries"
fi
rm -f $TEMP_FILEPerformance Optimization and Best Practices
Memory Usage Optimization
http {
limit_req_zone $binary_remote_addr zone=main_limit:50m rate=10r/s;
map $request_uri $normalized_uri {
~^/api/v1/([^/]+) /api/v1/$1;
~^/static/ /static;
default $request_uri;
}
limit_req_zone "$binary_remote_addr:$normalized_uri" zone=uri_limit:30m rate=20r/s;
server {
location / {
limit_req zone=main_limit burst=10;
limit_req zone=uri_limit burst=5;
proxy_pass http://backend;
proxy_cache my_cache;
proxy_cache_valid 200 1m;
proxy_cache_key "$scheme$proxy_host$normalized_uri";
}
}
}Configuration Modularity
# /etc/nginx/conf.d/rate_limits.conf
limit_req_zone $binary_remote_addr zone=global_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=5r/s;
limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=1r/s;
# /etc/nginx/conf.d/security_maps.conf
map $http_user_agent $is_malicious_bot { include /etc/nginx/maps/malicious_bots.map; }
map $geoip2_data_country_code $is_blocked_country { include /etc/nginx/maps/blocked_countries.map; }
# /etc/nginx/conf.d/security_headers.conf
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;Troubleshooting and Debugging
Common Issue Diagnosis
# Verify rate‑limit works
curl -I http://example.com/api/test
# Send many requests to trigger limit
for i in {1..20}; do curl -s -o /dev/null -w "%{http_code}
" http://example.com/api/test; done
# Show limit‑req zones
nginx -T | grep -A 10 limit_req_zonePerformance Monitoring Script
#!/bin/bash
check_nginx_performance() {
echo "=== Nginx Performance Report ==="
echo "Time: $(date)"
echo "Active Connections:"
ss -tln | grep :80 | wc -l
echo -e "
Rate Limiting Status:"
nginx -T 2>/dev/null | grep -c limit_req_zone
echo -e "
Error Rate (Last 100 requests):"
tail -100 /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c | sort -nr
echo -e "
Nginx Memory Usage:"
ps aux | grep nginx | grep -v grep | awk '{sum+=$6} END {print sum/1024 " MB"}'
}
check_nginx_performanceConclusion and Outlook
Core Advantages
Multi‑layer protection from basic rate limiting to advanced challenges.
Intelligent detection using User‑Agent, GeoIP, request patterns.
Performance‑focused configuration for high concurrency.
Operations‑friendly monitoring, alerts, and automated blacklist.
Implementation Recommendations
Start with basic limits, then add advanced modules gradually.
Use gray‑release to test new rules on a subset of traffic.
Set up real‑time monitoring and alerts.
Periodically review metrics and adjust parameters.
Future Trends
Behavior‑based AI analysis for smarter bot detection.
Real‑time learning models that adapt limits automatically.
Collaborative defense with shared threat intelligence across nodes.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
