Operations 17 min read

Mastering Web Log Analysis: Spotting Anomalies and Boosting Performance

This guide explains how to set up a Python‑based log‑analysis tool for web servers, defines key terminology, outlines installation requirements, describes its abstract‑based aggregation features, and demonstrates usage of the accompanying command‑line utility for request, IP, distribution, and detail analysis.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering Web Log Analysis: Spotting Anomalies and Boosting Performance

Log Analysis

Log analysis is crucial for troubleshooting and performance analysis in web systems. Unlike typical PV/UV dashboards, this tool provides fine‑grained (minute‑level) abstraction and aggregation of logs for anomaly detection and performance insights.

Environment Installation

Python 3.4+

pymongo 3.4.0+

MongoDB server

Key Terminology

uri

– the part of the request without parameters; request_uri – the original request, with or without parameters; args – the parameter part of the request (as defined by nginx). Abstracted forms: uri_abs – abstracted URI (e.g., "/sub/*/*/*"); args_abs – abstracted arguments (e.g., "channel=*&version=*").

Features

Provides a unified entry for log analysis across all servers of a site, with filtering by time range and server.

Supports analysis of request URI, IP, and response code, focusing on request count, response size, and response time.

Core idea: treat a class of uri or its related args as dimensions; abstract uri into uri_abs and args_abs.

Default abstraction satisfies most needs; custom rules can be defined to abstract any request part.

URI analysis shows request volume, latency, and traffic distribution over various granularities (minute, ten‑minute, hour, day) and can drill down to specific uri_abs and its args.

IP analysis groups requests into three sources (cdn/proxy, reverse proxy, direct client) and ranks IPs by request count within each group.

Uses quartile statistics for more accurate description of response time and size, avoiding misleading arithmetic averages.

High performance: the script runs on each web server via cron; processing speed reaches 20 000–30 000 lines/s on typical hardware.

Implementation Idea

The analysis script log_analyse.py is deployed on each web server and scheduled with crontab. It uses Python's re module to parse logs and extracts fields such as uri, args, timestamp, status code, response size, response time, and server name, then stores the processed data in MongoDB. The entry script log_show.py reads the stored data and provides various sub‑commands for analysis.

Prerequisites

All servers store log files in a unified path.

Log format and naming rules are consistent (e.g., xxx.access.log).

Logs are rotated daily at midnight.

The log format determines the regular expressions used in analyse_config.py. Example nginx log format:

log_format access '$remote_addr - [$time_local] "$request" '
               '$status $body_bytes_sent $request_time "$http_referer" '
               '"$http_user_agent" - $http_x_forwarded_for';

For other nginx or Apache formats, adjust the regex accordingly.

Handling Abnormal Logs

Because logs can be irregular, the tool prefers the re module over simple split(). Tolerable anomalies are processed with custom logic; intolerable ones are discarded and saved to a separate file. Defining a unique delimiter (e.g., "|") in nginx can simplify parsing.

log_show.py Usage

Help:

log_show --help
Usage: log_show <site_name> [options] request [distribution <request>|detail <uri>]
       log_show <site_name> [options] ip [distribution <ip>|detail <ip>]
       log_show <site_name> [options] error [distribution <error_code>|detail <error_code>]
Options:
  -h --help          Show this screen.
  -f --from <start>   Start time (format: %y%m%d[%H[%M]]).
  -t --to <end>       End time (same format as --from).
  -l --limit <num>    Limit number of output lines (0 = no limit, default 5).
  -s --server <srv>   Specify server hostname.
  -g --group_by <g>   Group by minute, ten_min, hour, or day (default hour).

Sub‑commands:

request – analyze today's stored data for a specific site.

ip – analyze logs grouped by IP source (cdn/proxy, reverse proxy, direct client).

distribution – aggregate metrics over a chosen time granularity for all requests or a specific URI.

detail – drill down into a specific uri_abs to see metrics per args_abs, or into an IP to see its request distribution across uri_abs.

Examples (abbreviated):

# Request distribution for "/view/*/*.json" grouped by minute
log_show api request distribution "/view/*/*.json" -g minute

# IP distribution for IP "140.206.109.174" over a day
log_show api ip -t 180314 distribution "140.206.109.174" -l 0

# Detail analysis for a specific URI
log_show api request detail "/recommend/batchUpdate" -l 3

These commands output total hits, bytes, time, and percentile distributions, helping identify abnormal URIs or IPs.

Deployment of log_analyse.py

Add a cron entry on each web server, e.g.:

*/15 * * * * export LANG=zh_CN.UTF-8; python3 /home/ljk/log_analyse.py > /tmp/log_analyse.log 2>&1

Note

The default abstraction rules replace numeric path segments with "*" in uri_abs and replace all argument values with "*" in args_abs. Custom rules can be added in common/common.py.

Author: jkklee – 6 years of operations experience, focusing on turning operational knowledge into reusable tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Web serverMongoDBlog analysis
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.