Operations 15 min read

How to Build an Automated Log‑Clustering Engine for Exception Monitoring

This article explains why monitoring abnormal code branches is crucial, outlines the challenges of log analysis, proposes a log‑clustering engine with spell and DBSCAN algorithms, describes its architecture, workflow, and implementation details, and highlights the benefits for system stability and operational efficiency.

Huolala Tech

Aug 15, 2024

How to Build an Automated Log‑Clustering Engine for Exception Monitoring

Background

"Most accidents can be traced back to failures in regulation and protection mechanisms." — James Reason

Importance of Monitoring Abnormal Branches

An abnormal branch occurs when a program deviates from the normal control flow due to bad input, unexpected operations, or unavailable resources.

Monitoring abnormal branches in code is essential because:

It improves system stability by detecting unexpected execution paths that could cause crashes or functional errors.

It enables rapid problem localization and fixes, reducing downtime.

It supports preventive maintenance by identifying patterns that may indicate future issues.

How to Monitor Abnormal Branches

The main purpose of monitoring is to continuously observe and analyze targets, automatically triggering alerts when anomalies appear.

Example code snippets:

private void example1() {
    try {
        doMethod();
    } catch (BussinessException bussinessException) {
        // abnormal branch
    }
}

private void example2() {
    if (doMethod() == null) {
        // abnormal branch
    }
}

Current monitoring options and their drawbacks:

Prometheus metrics: not feasible because not all abnormal branches need metrics, excessive instrumentation burdens the monitoring platform, and defining metrics is time‑consuming.

Throwing exceptions: not suitable for all abnormal branches.

Logging: useful for post‑mortem analysis but cannot trigger real‑time alerts.

Conclusion: there is no effective method to monitor abnormal branches directly.

Introducing a Log‑Clustering Engine

Since error logs are already recorded for abnormal branches, enhancing log monitoring can indirectly monitor these branches.

Current manual log analysis is inefficient and unreliable. Effective log monitoring requires:

Accurate log classification to grasp overall log landscape.

Time‑ordered storage of classified data for trend analysis.

Automatic alerts when new anomalous log categories appear or trends shift.

We plan to develop a "Log Clustering Engine" that classifies logs, automatically detects anomalies, and triggers alerts, thereby improving system monitoring and response speed.

Challenges

Determining Log Categories

How to decide which logs belong to which categories without manual labeling?

Correct Classification

Log information is scattered randomly, lacking fixed patterns, making simple string grouping ineffective.

Simple Integration Across Services

Different services produce diverse log formats; the tool must adapt automatically to new formats.

Benefits of Log Clustering

Fills the gap in abnormal‑branch monitoring.

Reduces manual log analysis costs.

Improves system stability through preventive maintenance and real‑time anomaly alerts.

Design Overview

Functional Breakdown

Scheduled trigger using XXL‑JOB.

Log retrieval via Kibana API.

Log clustering using selected algorithm.

Store templates and detailed logs in a database.

Dashboard for developers and ops to view clustering results and trends.

Send key metrics to Prometheus for monitoring and alerting.

Algorithm Selection

We evaluated Spell, K‑means, hierarchical clustering, and DBSCAN. Spell is tailored for log data, automatically extracts templates, and does not require preset cluster numbers. DBSCAN handles arbitrary shapes but is parameter‑sensitive. K‑means needs a predefined number of clusters, and hierarchical clustering is computationally heavy.

Initial experiments showed Spell outperforms DBSCAN in clustering quality, maintenance cost, and template extraction, so we chose Spell.

Spell Algorithm Overview

Spell parses logs, extracts constant and variable parts, and generates log templates using the Longest Common Subsequence (LCS) algorithm based on dynamic programming.

Features

Automatic Log Clustering

Example raw logs:

Error: Failed to load resource: the server responded with a status of 404 (Not Found)
Warning: Deprecated function called in /var/www/html/app.php on line 111
Error: Connection timeout while connecting to the database
... (additional log lines)

Clustered result (size indicates occurrence count):

size:3 Error: Failed to load resource: the server responded with a status of 404 (Not Found)
size:3 Error: Connection timeout while connecting to the database
size:3 Warning: Deprecated function called in /var/www/html/app.php on line <*>
size:2 Fatal: Maximum execution time of <*> seconds exceeded in /var/www/html/script.php
size:3 Notice: Undefined index: name in /var/www/html/index.php on line <*>
size:3 Warning: Invalid argument supplied for foreach() in /var/www/html/loop.php
size:3 Fatal: Out of memory (allocated <*>) (tried to allocate <*> bytes) in /var/www/html/memory.php