Operations 11 min read

Intelligent Log Classification and Anomaly Detection: Design and Implementation

This article presents a two‑stage streaming log classification system using an improved prefix‑tree and longest‑common‑subsequence algorithms, along with a statistical unsupervised anomaly detection method that leverages chi‑square aggregation and box‑plot scoring to reduce false alarms and accelerate template convergence.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Intelligent Log Classification and Anomaly Detection: Design and Implementation

Background

As systems become increasingly complex, the volume of generated logs grows dramatically. Manually locating anomalies in massive error logs is costly because of diverse log formats and high alert volume that obscure real problems.

Intelligent Log Classification

Design

The initial service used the Drain algorithm to extract log templates. To improve accuracy, a two‑stage streaming model was introduced: a pre‑classification followed by a second‑stage merging of un‑matched logs to produce the final classification.

First‑stage Classification

Implemented with an improved prefix tree.

From the root, the first layer groups logs by length; subsequent nodes split by token matching. A threshold creates a <*> node for unmatched logs. The tree depth is limited, making matching fast. The process is essentially a length‑augmented prefix tree.

Algorithm steps:

Pre‑process: split logs into tokens by delimiters/spaces.

Use token length to locate the length node.

Sequentially split based on tokens, limited by depth‑2.

At leaf, compute similarity (simSeq) with templates; select the highest similarity above threshold st .

Update the parse tree: replace differing tokens with <*> or add a new template if none matches.

Second‑stage Classification

Uses a longest common subsequence (LCS) matching algorithm.

First‑stage groups logs by length, which can miss similar logs with different lengths or token misalignments. LCS merging improves template quality. To mitigate LCS’s high complexity, prefix‑tree and simple loop pre‑matching reduce the number of LCS calculations.

Algorithm steps:

Pre‑process: split logs into tokens.

Prefix‑tree pre‑match; if matched, return classification and template.

Simple loop match; if matched, return classification and template.

LCS match; if matched, update template with <*> for differences; otherwise create new classification and template.

Traceback Log Handling

Traceback logs are treated line‑by‑line as tokens, allowing similar stack traces to be clustered together.

Cold‑Start Problem

Templates converge slowly when raw logs are fed directly. By pre‑replacing variables (numbers, base64 strings, addresses) with <*> using regex, convergence speeds up dramatically, turning a week‑long stabilization into immediate usability.

Log Anomaly Detection

Design

The detection method is statistical and unsupervised, requiring no manual labeling and offering good interpretability. It relies on the real‑time templates produced by the classification service and evaluates historical log volume per template.

Specific Algorithm

Operates on a 1‑minute granularity:

Collect per‑machine templates and their historical log counts.

Aggregate counts across templates using a chi‑square distribution.

Perform anomaly detection on the aggregated metric; if normal, return normal.

For each template, detect anomalies in its historical count and return the top‑5 anomalous templates.

If no anomalous templates are found, the overall result is normal; otherwise, raise an alert with the anomalous templates.

Two‑step detection reduces false alarms and CPU load by first checking a global metric before drilling down to individual templates.

The global metric is the sum of squares of template frequencies, assumed to follow a chi‑square distribution; a 3‑sigma rule is applied for detection.

In addition to the 3‑sigma method, a box‑plot based score is computed and fused with the sigma score.

Reducing Periodic False Alarms

Historical scores from the same time on previous days are used to suppress current scores, applying a rolling‑max over a window and subtracting the larger of yesterday’s or last week’s score.

New Log Category Alerts

An adaptive alert triggers when the number of new log categories in a day exceeds a threshold; otherwise alerts are suppressed to avoid noise during template stabilization.

Application Results

When an anomaly is detected, the system notifies the user of the anomalous template, and the log count page shows a sudden spike in the corresponding logs.

Operationsanomaly detectionunsupervised learningLCS algorithmlog classificationstream clustering
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.