Operations 11 min read

How Intelligent Log Classification and Anomaly Detection Slash Fault‑Diagnosis Time

This article explains a two‑stage streaming log classification system that uses a prefix‑tree and longest‑common‑subsequence matching to generate log templates, handles traceback logs and cold‑start issues, and combines statistical unsupervised methods to detect anomalies while reducing false alarms.

dbaplus Community

Jun 5, 2022

How Intelligent Log Classification and Anomaly Detection Slash Fault‑Diagnosis Time

Background

As systems become more complex, the volume of generated logs grows dramatically, making manual fault isolation costly. Diverse log formats, massive log volume, and noisy error messages hinder effective troubleshooting.

Intelligent Log Classification

Design Overview

The initial service extracted log templates with the Drain algorithm. To improve accuracy, a two‑level model was introduced: a first‑stage pre‑classification followed by a second‑stage refinement that merges un‑classified logs into final categories.

1. First‑stage Classification

A modified prefix‑tree (trie) is used. Logs are tokenised by spaces, then routed by token length (first layer) and by token values (subsequent layers). When a leaf reaches a threshold, a wildcard <*> node is added to catch unmatched logs. The tree depth is deliberately shallow for fast matching.

Pre‑process: split each log line into tokens using delimiters/spaces.

Navigate to the length‑layer node matching the token count.

Iteratively split nodes according to token order, limited by depth‑2 (excluding root and length layers).

At leaf nodes, compute similarity simSeq against existing templates; return the highest‑scoring template above threshold st.

Update the parse tree: replace differing tokens with <*> or add a new template if none matches.

2. Second‑stage Classification

The second stage applies a longest common subsequence (LCS) matching algorithm to merge templates produced by the first stage, improving clustering quality. To mitigate LCS’s high computational cost, two pre‑matching filters—another prefix‑tree and a simple loop—are applied before full LCS.

Traceback Log Handling

Unlike regular logs, traceback logs have a line‑oriented structure. Each line is treated as a token, allowing similar stack traces to be grouped together for easier investigation.

Cold‑Start Problem

The classification algorithm naturally extracts constant parts of logs and replaces variable parts with <*>. However, feeding raw logs leads to slow template convergence. By pre‑replacing numbers, base64 strings, and encoded addresses with <*> using regular expressions, convergence time drops from a week to near‑instant after deployment.

Log Anomaly Detection

Design Overview

An unsupervised, statistical template‑based anomaly detector runs on top of the classification service. It requires no manual labeling, is computationally cheap, and provides interpretable results.

Algorithm Steps (1‑minute granularity)

Collect per‑machine log‑template counts and their historical volumes.

Aggregate counts across templates using a chi‑square distribution.

Perform a first‑level anomaly check on the aggregated metric; if normal, return “OK”.

If abnormal, drill down to individual templates, compute anomaly scores, and return the top‑5 anomalous templates.

Emit alerts when anomalies are detected, otherwise report normal.

Two statistical methods are combined: a 3‑sigma rule on the chi‑square‑based global metric and a box‑plot‑based score on individual templates. Their scores are fused into a final anomaly score.

Reducing Periodic False Alarms

Historical anomalies from the same time of day (yesterday or last week) are smoothed with a rolling‑max window and subtracted from the current score, suppressing recurring false positives while accounting for possible time shifts.

New Log Category Alerts

When a surge of new log templates appears, the system adapts: if the daily count of new categories exceeds a threshold, alerts are suppressed to avoid noise; otherwise, a warning is issued.

Practical Results

The anomaly detector highlights the offending template when an anomaly is found.

Log count dashboards show a clear spike at the moment of the detected anomaly, confirming the system’s effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Anomaly Detection log analysis Prefix Tree stream clustering LCS

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.