How Intelligent Log Classification and Anomaly Detection Slash Fault‑Diagnosis Time
This article explains a two‑stage streaming log classification system that uses a prefix‑tree and longest‑common‑subsequence matching to generate log templates, handles traceback logs and cold‑start issues, and combines statistical unsupervised methods to detect anomalies while reducing false alarms.
Background
As systems become more complex, the volume of generated logs grows dramatically, making manual fault isolation costly. Diverse log formats, massive log volume, and noisy error messages hinder effective troubleshooting.
Intelligent Log Classification
Design Overview
The initial service extracted log templates with the Drain algorithm. To improve accuracy, a two‑level model was introduced: a first‑stage pre‑classification followed by a second‑stage refinement that merges un‑classified logs into final categories.
1. First‑stage Classification
A modified prefix‑tree (trie) is used. Logs are tokenised by spaces, then routed by token length (first layer) and by token values (subsequent layers). When a leaf reaches a threshold, a wildcard <*> node is added to catch unmatched logs. The tree depth is deliberately shallow for fast matching.
Pre‑process: split each log line into tokens using delimiters/spaces.
Navigate to the length‑layer node matching the token count.
Iteratively split nodes according to token order, limited by depth‑2 (excluding root and length layers).
At leaf nodes, compute similarity simSeq against existing templates; return the highest‑scoring template above threshold st.
Update the parse tree: replace differing tokens with <*> or add a new template if none matches.
2. Second‑stage Classification
The second stage applies a longest common subsequence (LCS) matching algorithm to merge templates produced by the first stage, improving clustering quality. To mitigate LCS’s high computational cost, two pre‑matching filters—another prefix‑tree and a simple loop—are applied before full LCS.
Traceback Log Handling
Unlike regular logs, traceback logs have a line‑oriented structure. Each line is treated as a token, allowing similar stack traces to be grouped together for easier investigation.
Cold‑Start Problem
The classification algorithm naturally extracts constant parts of logs and replaces variable parts with <*>. However, feeding raw logs leads to slow template convergence. By pre‑replacing numbers, base64 strings, and encoded addresses with <*> using regular expressions, convergence time drops from a week to near‑instant after deployment.
Log Anomaly Detection
Design Overview
An unsupervised, statistical template‑based anomaly detector runs on top of the classification service. It requires no manual labeling, is computationally cheap, and provides interpretable results.
Algorithm Steps (1‑minute granularity)
Collect per‑machine log‑template counts and their historical volumes.
Aggregate counts across templates using a chi‑square distribution.
Perform a first‑level anomaly check on the aggregated metric; if normal, return “OK”.
If abnormal, drill down to individual templates, compute anomaly scores, and return the top‑5 anomalous templates.
Emit alerts when anomalies are detected, otherwise report normal.
Two statistical methods are combined: a 3‑sigma rule on the chi‑square‑based global metric and a box‑plot‑based score on individual templates. Their scores are fused into a final anomaly score.
Reducing Periodic False Alarms
Historical anomalies from the same time of day (yesterday or last week) are smoothed with a rolling‑max window and subtracted from the current score, suppressing recurring false positives while accounting for possible time shifts.
New Log Category Alerts
When a surge of new log templates appears, the system adapts: if the daily count of new categories exceeds a threshold, alerts are suppressed to avoid noise; otherwise, a warning is issued.
Practical Results
The anomaly detector highlights the offending template when an anomaly is found.
Log count dashboards show a clear spike at the moment of the detected anomaly, confirming the system’s effectiveness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
