Predicting Server Memory Failures with Machine Learning: Feature Selection, Data Preprocessing, and Model Evaluation
This article presents a machine‑learning approach to predict DRAM failures in large‑scale data centers by analyzing server logs, selecting state, log, and static features through statistical tests and mutual information, preprocessing the data, and employing a tree‑based ensemble classifier that outperforms industry baselines.
Memory (DRAM) failures are a common hardware issue that can cause outages in large‑scale data centers; predicting such failures using server logs and machine‑learning models is essential for reducing unexpected downtime.
The study categorizes predictive features into three groups: (1) state information such as CPU load, memory usage, temperature, and power consumption; (2) log information from system logs like mcelog; and (3) static information describing server and memory attributes (vendor, firmware, speed, etc.).
For feature selection, a T‑test was applied to state‑time‑series data to identify variables with significant differences between failing and normal servers within the six days preceding a failure. The most significant features (XXXX1, XXXX2, XXXX3) were chosen as inputs (see Table 1).
stat
p_value
XXXX1
0.0067
XXXX2
6e-6
XXXX3
0.04
Log selection involved counting occurrences of different log messages in servers that later experienced memory failures, focusing only on logs generated up to five minutes before the failure. The most frequent log type (xxx log) was retained as a feature (see Table 2).
Log Content
Total Failures
xxx log Count
Machine Count
492
252
Static features, being categorical strings, were first encoded numerically. Mutual information was then used to rank these features; three static attributes (XXX1, XXX2, XXX3) showed the highest relevance (see Table 3).
Static Feature
Correlation
XXX1
0.9
XXX2
0.6
XXX3
0.1
Data preprocessing includes sliding‑window segmentation, balancing the heavily skewed positive‑negative sample ratio (memory failures are rare), and shifting the failure label forward to align with relevant log windows.
The prediction task is a binary classification problem. After comparing several supervised models, a tree‑based ensemble classifier was selected for its ability to handle mixed data types, strong interpretability, modest data requirements, good generalization, and relative insensitivity to class imbalance.
Experimental results show that the proposed model improves recall and precision by at least 10 % over current industry solutions, as illustrated in the accompanying figure.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.