Uncovering Bot Traffic: Why AI Crawlers Dominate 47% of My Site’s Visits
A comprehensive analysis of a year‑long Nginx log dataset reveals that nearly half of all requests come from bots—especially AI crawlers—while real users account for less than half, highlighting security threats, attack patterns, and the effectiveness of blacklist defenses across monthly, daily, and hourly dimensions.
Overview
The personal blog 2tuan.work has been running for almost a year after a domain change. Using Nginx access logs, a detailed operational and security analysis was performed to understand traffic composition, bot behavior, attack activity, and user engagement.
Key Metrics
Total requests (excluding admin IPs): 416,179
Detected attacks: 37,392
Attack‑origin IPs: 2,087
Blacklist‑blocked IPs: 11,530
Unique visitor IPs: 39,036
Real user visits: 19,874 (42.62% of traffic)
Traffic Composition
Normal user traffic: 42.6% (177,375 requests)
Bot traffic (including SEO, search engines, AI crawlers): 47.1% (196,199 requests)
Requests without User‑Agent: 10.2% (42,605 requests)
Attack traffic: 9.0% (37,392 attempts)
Bot Classification
Four major bot categories were identified:
SEO crawlers : 37,877 requests, 2,296 unique IPs, avg 16.5 req/IP.
Search engine bots : 45,783 requests, 4,111 unique IPs, avg 11.1 req/IP.
AI crawlers : 29,228 requests, 1,989 unique IPs, avg 14.7 req/IP.
Monitoring scanners : 4,162 requests, 496 unique IPs, avg 8.4 req/IP.
AI Crawler Insights
The AI bot traffic grew sharply in February 2025 (5,529 requests, 17.18% of that month) and peaked again in August 2025 (4,766 requests, 10.47%). The dominant AI crawler is Bytespider (49.6% of AI bot requests), followed by GPTBot (23.9%) and ClaudeBot (14.2%).
Tooling and Language Distribution
Other/custom clients: 84.5% of malicious/unknown bot requests.
Go‑based tools: 7.1% (4,693 requests).
Python tools: 4.8% (3,196 requests).
cURL: 3.4% (2,251 requests).
Java, Node.js, Wget each < 0.3%.
Attack Types
Information‑leakage scans: 27,393 (73.3%).
Command injection: 4,521 (12.1%).
CVE exploitation: 4,104 (11.0%).
Path traversal: 726 (1.9%).
Brute‑force login attempts: 609 (1.6%).
SQL injection: 38 (0.1%).
XSS: 1 (0.0%).
Typical detection patterns include requests for /.env, /.git/, /vendor/phpunit/phpunit, and shell‑execution strings such as ;.*shs+.
Blacklist Effectiveness
The OpenResty blacklist script blocked 11,530 malicious IPs, rejecting 159,980 requests (≈9% of total traffic). This reduced server load and mitigated many attack vectors.
User Behavior
Total real‑user requests: 177,375.
Average requests per user: 7.9.
User activity distribution: 91.5% light users (<10 visits), 6.8% moderate (10‑49 visits), 1.7% heavy (≥50 visits).
Temporal Patterns
Monthly peaks occurred in July 2025 (52,106 total requests) and March 2025 (20,129 real‑user requests). Hourly analysis shows normal users peak on Monday 22:00‑22:59, while bots peak on Monday 18:00‑18:59.
Key Takeaways
Real users generate less than half of the traffic; bots dominate.
AI crawlers are a growing segment, with Bytespider leading.
Security threats are primarily information‑leakage scans and command‑injection attempts.
Blacklist rules are effective, blocking over 11k malicious IPs.
Understanding bot signatures (User‑Agent strings, request patterns) helps refine detection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
