Build a High‑Performance Web Log Analyzer with Python and MongoDB
This article introduces a Python‑based log analysis tool for web servers that provides minute‑level aggregation, abstracted URI and argument patterns, and multi‑dimensional performance metrics, along with installation steps, core features, implementation details, usage commands, and deployment guidelines.
Introduction
Log analysis plays a crucial role in troubleshooting and performance analysis of web systems. This tool focuses on fine‑grained, minute‑level logs, providing abstraction and summarisation to locate anomalies and evaluate performance.
Environment Installation
Python 3.4+
pymongo 3.4.0+
MongoDB server
Key Terminology
uri– the part of a request without parameters. request_uri – the original request, with or without parameters. args – the parameter part of a request (as defined in nginx). uri_abs and args_abs – abstracted strings of uri and args for classification, e.g. "/sub/0/100414/4070?channel=ios&version=1.4.5" becomes uri_abs: "/sub/*/*/*" and args_abs: "channel=*&version=*".
Features
Provides a unified entry for log analysis across all servers, with filtering by time period and server.
Supports analysis of request URI, IP, and response code based on request count, response size, and response time.
Core idea: treat a class of uri or its corresponding args as a dimension; abstract uri into uri_abs and args_abs.
Default abstraction rules satisfy most needs; custom rules can be defined for flexible abstraction. uri analysis shows which request classes have many hits, large traffic, or long latency, and can display distribution over minute, ten‑minute, hour, or day intervals. It also allows drilling down to args_abs for a specific uri_abs.
IP analysis groups requests into three sources (cdn/proxy, reverse proxy, client directly), shows top N IPs per source, their metric distribution over time, and uri_abs distribution per IP.
Uses the 4‑quartile concept to describe response time and size more accurately than arithmetic mean.
High performance: the script runs on each web server via cron; on a 3×7200 rpm RAID5 server with gigabit LAN it processes 20 000–30 000 lines per second.
Implementation Idea
The analysis script log_analyse.py is deployed on each web server and scheduled with crontab. It uses Python's re module to parse logs, extracting uri, args, timestamp, status code, response size, response time, and server name, then stores the processed data into MongoDB. The viewer script log_show.py serves as the entry point for querying and visualising the aggregated data. Real‑time capability depends on the execution frequency of log_analyse.py.
Prerequisites
All servers store log files in a unified path.
Log format and naming rules are consistent (e.g., xxx.access.log).
Daily log rotation at midnight.
Log Format Example
log_format access '$remote_addr - [$time_local] "$request" '
'$status $body_bytes_sent $request_time "$http_referer" '
'"$http_user_agent" - $http_x_forwarded_for';The same principle can be applied to other nginx or Apache log formats with minor adjustments.
Handling Abnormal Logs
Instead of simple split(), the tool uses the re module to tolerate irregular records. Tolerable anomalies are processed with custom logic; intolerable ones are discarded and logged separately. Defining a unique delimiter (e.g., |) in the nginx log format can simplify parsing.
Usage of log_show.py
Run log_show.py --help for help. Main sub‑commands:
request – analyse requests for a site, optionally filtered by time, server, or specific URI.
ip – analyse traffic based on IP address, showing source breakdown.
distribution – aggregate metrics (hits, bytes, time) over minute, ten‑minute, hour, or day granularity for all requests or a specific URI.
detail – drill down into a specific uri_abs to see args_abs distribution, or into an IP to see its uri_abs distribution.
Examples (output trimmed for brevity):
# Request example
$ log_show api request -f 180201 -l 3
Total_hits: 10069 Total_bytes: 7.62 MB
uri_abs: /recommend/batchUpdate
... (hits, bytes, time, args_abs breakdown) # IP example
$ log_show api ip -t 180314 distribution "140.206.109.174" -l 0
IP: 140.206.109.174
Total_hits: 10999 Total_bytes: 4.83 MB
hour hits(%) bytes(%)
18031306 1273 11.57% 765.40 KB 15.47%
... # Distribution example
$ log_show api request distribution "/view/*/*.json" -g minute -l 5
Total_hits: 17130 Total_bytes: 23.92 MB
minute hits hits(%) bytes bytes(%) time_distribution(s) bytes_distribution(B)
1803091654 1543 9.01% 2.15 MB 8.98% ...
... # Detail example
$ log_show api request detail "/recommend/update" -l 3
uri_abs: /recommend/batchUpdate
Total_hits: 10069 Total_bytes: 7.62 MB
args_abs breakdown:
uid=* & category_id=* & channel=* & version=* 4568 hits 45.37% ...Deployment
Add a cron job to run the analyser periodically, for example:
*/15 * * * * export LANG=zh_CN.UTF-8; python3 /home/ljk/log_analyse.py > /tmp/log_analyse.logNote on Abstraction
uri_absabstracts numeric path segments to *; args_abs replaces all argument values with *. Custom rules can be defined in analyse_config.py to suit specific log formats.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
