How to Build a Fast, Accurate Log Analyzer with Python and MySQL
This article explains how to create a lightweight yet reliable Python‑based log analysis tool that parses nginx logs with regular expressions, stores detailed metrics in MySQL, and provides fine‑grained performance and anomaly reports for web services.
Log analysis is crucial for troubleshooting and performance tuning in web systems. While the open‑source ELK stack is powerful, its deployment and learning costs are high, so the author implemented a simpler, accurate, and efficient Python solution focused on short‑term (three‑day or weekly) fine‑grained anomaly and performance analysis.
Pain points include handling massive traffic (≈50 million PV per day), detecting CDN back‑origin spikes, isolating abnormal URLs, and analyzing brief traffic fluctuations between servers, databases, or caches. The goal is to run analysis locally on each application server, aggregate results in MySQL, and achieve near‑real‑time detection.
Requirements are unified log paths, consistent log formats, and daily log rotation at midnight.
The author’s nginx log format is shown (image omitted). The analysis works by using Python’s re module to extract fields such as URI, arguments, timestamp, status code, response size, response time, client IP, CDN IP, and server name, then storing them in a database.
For other nginx or Apache formats, the same principle applies: adjust the regular expression and database schema accordingly.
Log records are parsed with a single regex; individual fields are accessed via log_pattern_obj.search(log).group('field_name'). Unusual or malformed lines are either tolerated with custom logic or discarded, returning empty strings and logging the problematic line.
To simplify parsing, defining a unique delimiter (e.g., "|") in the nginx log format can allow plain string splitting instead of regex.
Database usage relies on the pymysql package and a MySQL 5.6 instance with InnoDB Barracuda file format and compressed row format to reduce storage size by about 50%.
Sample SQL queries demonstrate how to retrieve daily/hourly PV, total URL counts (or counts within a time window), average response time rankings, and average response size rankings, using the uri_abs_crc32 and args_abs_crc32 columns for efficient grouping.
Performance testing on a 4‑core virtual machine shows the script’s execution efficiency (image omitted).
To schedule regular analysis, a cron job like
*/30 * * * * export LANG=zh_CN.UTF-8;python3 /root/log_analyse_parall.py > /tmp/log_analyse.py3is used.
The key factors for such a log‑analysis script are generality —the ability to adapt to different log formats with minimal changes—and execution efficiency , ensuring the analysis does not impact the production service.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
