Operations 17 min read

Build a High‑Performance Web Log Analyzer with Python and MongoDB

This article introduces a Python‑based log analysis tool for web servers that provides minute‑level aggregation, abstracted URI and argument patterns, and multi‑dimensional performance metrics, along with installation steps, core features, implementation details, usage commands, and deployment guidelines.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Build a High‑Performance Web Log Analyzer with Python and MongoDB

Introduction

Log analysis plays a crucial role in troubleshooting and performance analysis of web systems. This tool focuses on fine‑grained, minute‑level logs, providing abstraction and summarisation to locate anomalies and evaluate performance.

Environment Installation

Python 3.4+

pymongo 3.4.0+

MongoDB server

Key Terminology

uri

– the part of a request without parameters. request_uri – the original request, with or without parameters. args – the parameter part of a request (as defined in nginx). uri_abs and args_abs – abstracted strings of uri and args for classification, e.g. "/sub/0/100414/4070?channel=ios&version=1.4.5" becomes uri_abs: "/sub/*/*/*" and args_abs: "channel=*&version=*".

Features

Provides a unified entry for log analysis across all servers, with filtering by time period and server.

Supports analysis of request URI, IP, and response code based on request count, response size, and response time.

Core idea: treat a class of uri or its corresponding args as a dimension; abstract uri into uri_abs and args_abs.

Default abstraction rules satisfy most needs; custom rules can be defined for flexible abstraction. uri analysis shows which request classes have many hits, large traffic, or long latency, and can display distribution over minute, ten‑minute, hour, or day intervals. It also allows drilling down to args_abs for a specific uri_abs.

IP analysis groups requests into three sources (cdn/proxy, reverse proxy, client directly), shows top N IPs per source, their metric distribution over time, and uri_abs distribution per IP.

Uses the 4‑quartile concept to describe response time and size more accurately than arithmetic mean.

High performance: the script runs on each web server via cron; on a 3×7200 rpm RAID5 server with gigabit LAN it processes 20 000–30 000 lines per second.

Implementation Idea

The analysis script log_analyse.py is deployed on each web server and scheduled with crontab. It uses Python's re module to parse logs, extracting uri, args, timestamp, status code, response size, response time, and server name, then stores the processed data into MongoDB. The viewer script log_show.py serves as the entry point for querying and visualising the aggregated data. Real‑time capability depends on the execution frequency of log_analyse.py.

Prerequisites

All servers store log files in a unified path.

Log format and naming rules are consistent (e.g., xxx.access.log).

Daily log rotation at midnight.

Log Format Example

log_format  access  '$remote_addr - [$time_local] "$request" '
'$status $body_bytes_sent $request_time "$http_referer" '
'"$http_user_agent" - $http_x_forwarded_for';

The same principle can be applied to other nginx or Apache log formats with minor adjustments.

Handling Abnormal Logs

Instead of simple split(), the tool uses the re module to tolerate irregular records. Tolerable anomalies are processed with custom logic; intolerable ones are discarded and logged separately. Defining a unique delimiter (e.g., |) in the nginx log format can simplify parsing.

Usage of log_show.py

Run log_show.py --help for help. Main sub‑commands:

request – analyse requests for a site, optionally filtered by time, server, or specific URI.

ip – analyse traffic based on IP address, showing source breakdown.

distribution – aggregate metrics (hits, bytes, time) over minute, ten‑minute, hour, or day granularity for all requests or a specific URI.

detail – drill down into a specific uri_abs to see args_abs distribution, or into an IP to see its uri_abs distribution.

Examples (output trimmed for brevity):

# Request example
$ log_show api request -f 180201 -l 3
Total_hits: 10069  Total_bytes: 7.62 MB
uri_abs: /recommend/batchUpdate
... (hits, bytes, time, args_abs breakdown)
# IP example
$ log_show api ip -t 180314 distribution "140.206.109.174" -l 0
IP: 140.206.109.174
Total_hits: 10999  Total_bytes: 4.83 MB
hour  hits(%)  bytes(%)
18031306 1273 11.57% 765.40 KB 15.47%
...
# Distribution example
$ log_show api request distribution "/view/*/*.json" -g minute -l 5
Total_hits: 17130  Total_bytes: 23.92 MB
minute   hits  hits(%)  bytes  bytes(%)  time_distribution(s)  bytes_distribution(B)
1803091654 1543 9.01% 2.15 MB 8.98% ...
...
# Detail example
$ log_show api request detail "/recommend/update" -l 3
uri_abs: /recommend/batchUpdate
Total_hits: 10069  Total_bytes: 7.62 MB
args_abs breakdown:
uid=* & category_id=* & channel=* & version=*  4568 hits 45.37% ...

Deployment

Add a cron job to run the analyser periodically, for example:

*/15 * * * * export LANG=zh_CN.UTF-8; python3 /home/ljk/log_analyse.py > /tmp/log_analyse.log

Note on Abstraction

uri_abs

abstracts numeric path segments to *; args_abs replaces all argument values with *. Custom rules can be defined in analyse_config.py to suit specific log formats.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonWeb PerformanceMongoDBlog analysis
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.