Real-Time Log Intelligent Classification Practice
This article describes how NetEase built a real‑time log intelligent classification system using Flink and AI algorithms, detailing the challenges of massive log volumes, the Drain template‑extraction method, algorithm workflow, performance results, and a practical case study that demonstrates reduced alert storms and faster issue diagnosis.
Real-Time Log Intelligent Classification Practice
In game and platform back‑ends, logs are increasingly used to record system state and behavior, but massive log volumes cause difficulties such as hidden anomalies, noisy alerts, undetected behavior changes, and alert storms.
NetEase addressed these problems by leveraging the Flink real‑time computation engine and AI models to provide real‑time log intelligent classification, combined with alerting and full‑text search, which reduces the cost of discovering and locating abnormal logs.
Real‑Time Log Intelligent Analysis Overview
Loghub real‑time intelligent analysis automatically groups similar logs by extracting log templates using an AI model.
In a game server scenario, the entire data processing flow is illustrated below:
Thanks to Loghub's real‑time ingestion capability, massive logs can be classified and output automatically within seconds.
Key Features
No need to define log classification rules; the AI model automatically merges similar logs.
Customizable classification precision and flexible control of the number of classes.
Support for merging similar error logs before alerting, reducing alert noise.
Application Scenarios
When manual log categorization is infeasible and rule‑based configuration is difficult.
Large log volumes and excessive alerts, requiring consolidation of similar anomalies.
Diverse log formats that need preliminary grouping before further analysis.
Classification Algorithm Overview
After reviewing various log clustering solutions, NetEase chose a template‑extraction approach for classification.
Log Template Concepts
In the example, logs 1 and 2 belong to the same class, and the template represents the common parts; the * symbol denotes variable sections.
Using templates makes it possible to judge whether a group of logs is reasonable and to understand what the class represents.
Drain Algorithm
Among template‑extraction algorithms, the Drain algorithm was selected for its robustness, accuracy, speed, and incremental learning capability, making it suitable for online real‑time classification.
Concept Explanation: In NLP and text analysis, texts are split into tokens. For log classification, each token may be a word, phrase, or even a whole sentence. For example, the sentence "这是一个测试用的句子;用来表示 Token 的意思" is split into two tokens: "这是一个测试用的句子" and "用来表示 Token 的意思" .
Parse Tree Structure
The algorithm builds a fixed‑depth tree where the root is Root, the second layer groups logs by length, and leaf nodes store the templates belonging to that path.
Each node corresponds to a token; a log follows the path only if its tokens match the node tokens. The tree depth is typically set to 4, matching the first two tokens, and a * child is added when a node exceeds a child‑count threshold.
Algorithm Workflow
Step1: Preprocess – optionally replace parts of the log with * using regex to improve speed and accuracy, then split the log into tokens.
Step2: Use the token length to locate the corresponding node on the second layer.
Step3: Split the log token‑by‑token down the tree, limited by the configured depth.
Step4: At the leaf node, compute similarity simSeq between the log and each template; return the template with the highest similarity above threshold st.
Step5: Update the Parse Tree – if a log matches a template but has differing tokens, replace those tokens with *; if no template matches, add a new template to the leaf.
Step3 Detail: Each node represents a token and is traversed in order (e.g., "Receive" → "from" → "node" → "4"). If the first token is a variable, a * child limits branch explosion, controlled by a maxChild parameter.
Step4 Detail: Similarity is calculated as the proportion of matching tokens; the template with the most * matches above st is selected.
Traceback Mode
For stack‑trace logs, a special set of parameters and preprocessing rules are applied to handle long lines and variable information, improving classification accuracy for traceback logs.
Other algorithms such as Rebucket consider frame distance and weighting but suffer from higher complexity and require clean traceback logs.
Test Results
Tests on several open‑source datasets show high classification accuracy. In a production scenario, 660k error‑level logs were grouped into 139 classes, 17k critical logs into 20 classes, and 30k traceback logs into 190 classes, achieving up to 800× reduction in alert noise.
Computational Efficiency
The algorithm’s time complexity is O((d + c·m)·n), where d is tree depth, c is the number of templates per leaf, m is log length, and n is the number of logs. Since d, m, and c are effectively constants, the overall complexity is linear O(n). In single‑core tests on large game logs, the Golang SDK achieved roughly 170 K QPS.
Case Study
A high‑concurrency salt‑value management service required >100 k QPS and >200 TB of key‑salt mappings. The architecture (illustrated below) was stress‑tested at 10 k QPS, generating massive logs.
During the test, >30 % of requests failed, producing a flood of diverse error logs. By enabling real‑time log classification, the dashboard displayed aggregated error categories, quickly revealing that most failures originated from a specific MongoDB cluster and the QueryOneByAppKey method.
Identifying the bottleneck allowed the team to add a cache and resolve the performance issue.
Even without prior real‑time classification, Logtail’s query‑time classification can group logs on‑the‑fly, as shown below.
Overall, Loghub’s real‑time intelligent log analysis provides low‑cost, high‑value assistance for unknown problem discovery and rapid issue localization.
Future plans combine intelligent classification with anomaly detection to monitor unexpected spikes while avoiding alert storms, e.g., automatically classifying Python traceback logs and suppressing known issues unless their volume changes dramatically.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
