How to Build an Automated Log‑Clustering Engine for Exception Monitoring
This article explains why monitoring abnormal code branches is crucial, outlines the challenges of log analysis, proposes a log‑clustering engine with spell and DBSCAN algorithms, describes its architecture, workflow, and implementation details, and highlights the benefits for system stability and operational efficiency.
Background
"Most accidents can be traced back to failures in regulation and protection mechanisms." — James Reason
Importance of Monitoring Abnormal Branches
An abnormal branch occurs when a program deviates from the normal control flow due to bad input, unexpected operations, or unavailable resources.
Monitoring abnormal branches in code is essential because:
It improves system stability by detecting unexpected execution paths that could cause crashes or functional errors.
It enables rapid problem localization and fixes, reducing downtime.
It supports preventive maintenance by identifying patterns that may indicate future issues.
How to Monitor Abnormal Branches
The main purpose of monitoring is to continuously observe and analyze targets, automatically triggering alerts when anomalies appear.
Example code snippets:
private void example1() {
try {
doMethod();
} catch (BussinessException bussinessException) {
// abnormal branch
}
}
private void example2() {
if (doMethod() == null) {
// abnormal branch
}
}Current monitoring options and their drawbacks:
Prometheus metrics: not feasible because not all abnormal branches need metrics, excessive instrumentation burdens the monitoring platform, and defining metrics is time‑consuming.
Throwing exceptions: not suitable for all abnormal branches.
Logging: useful for post‑mortem analysis but cannot trigger real‑time alerts.
Conclusion: there is no effective method to monitor abnormal branches directly.
Introducing a Log‑Clustering Engine
Since error logs are already recorded for abnormal branches, enhancing log monitoring can indirectly monitor these branches.
Current manual log analysis is inefficient and unreliable. Effective log monitoring requires:
Accurate log classification to grasp overall log landscape.
Time‑ordered storage of classified data for trend analysis.
Automatic alerts when new anomalous log categories appear or trends shift.
We plan to develop a "Log Clustering Engine" that classifies logs, automatically detects anomalies, and triggers alerts, thereby improving system monitoring and response speed.
Challenges
Determining Log Categories
How to decide which logs belong to which categories without manual labeling?
Correct Classification
Log information is scattered randomly, lacking fixed patterns, making simple string grouping ineffective.
Simple Integration Across Services
Different services produce diverse log formats; the tool must adapt automatically to new formats.
Benefits of Log Clustering
Fills the gap in abnormal‑branch monitoring.
Reduces manual log analysis costs.
Improves system stability through preventive maintenance and real‑time anomaly alerts.
Design Overview
Functional Breakdown
Scheduled trigger using XXL‑JOB.
Log retrieval via Kibana API.
Log clustering using selected algorithm.
Store templates and detailed logs in a database.
Dashboard for developers and ops to view clustering results and trends.
Send key metrics to Prometheus for monitoring and alerting.
Algorithm Selection
We evaluated Spell, K‑means, hierarchical clustering, and DBSCAN. Spell is tailored for log data, automatically extracts templates, and does not require preset cluster numbers. DBSCAN handles arbitrary shapes but is parameter‑sensitive. K‑means needs a predefined number of clusters, and hierarchical clustering is computationally heavy.
Initial experiments showed Spell outperforms DBSCAN in clustering quality, maintenance cost, and template extraction, so we chose Spell.
Spell Algorithm Overview
Spell parses logs, extracts constant and variable parts, and generates log templates using the Longest Common Subsequence (LCS) algorithm based on dynamic programming.
Features
Automatic Log Clustering
Example raw logs:
Error: Failed to load resource: the server responded with a status of 404 (Not Found)
Warning: Deprecated function called in /var/www/html/app.php on line 111
Error: Connection timeout while connecting to the database
... (additional log lines)Clustered result (size indicates occurrence count):
size:3 Error: Failed to load resource: the server responded with a status of 404 (Not Found)
size:3 Error: Connection timeout while connecting to the database
size:3 Warning: Deprecated function called in /var/www/html/app.php on line <*>
size:2 Fatal: Maximum execution time of <*> seconds exceeded in /var/www/html/script.php
size:3 Notice: Undefined index: name in /var/www/html/index.php on line <*>
size:3 Warning: Invalid argument supplied for foreach() in /var/www/html/loop.php
size:3 Fatal: Out of memory (allocated <*>) (tried to allocate <*> bytes) in /var/www/html/memory.phpData Visualization
Self‑Generated Alerts
Cluster templates are exported as metrics to Prometheus, enabling trend‑based alerts.
Advantages
Efficient log parsing without predefined templates.
Self‑adapting to new log types without manual intervention.
Low resource consumption compared to heavyweight machine‑learning solutions.
Easy integration for new services, reducing onboarding effort.
"You cannot manage what you do not measure." — Peter Drucker
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
