Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers
The article explains how Vivo’s data centers use the Linux EDAC framework—combined with sysfs mapping, driver selection, and APEI error‑injection testing—to monitor per‑DIMM correctable and uncorrectable errors, enabling early fault prediction, reducing server crashes, and supporting broader RAS strategies.
With the rapid growth of internet services, the availability of infrastructure has become a critical concern. Memory failures, which occur frequently and have a high impact, are the second most common hardware fault after disks in Vivo’s data centers, where more than 400,000 DIMMs are deployed. Uncorrectable errors (UCE) can cause immediate server crashes, while an accumulation of correctable errors (CE) eventually leads to UCE.
The traditional approach of relying on Machine Check Exception (MCE) logs and BMC SEL records can only detect faults after a crash. The EDAC (Error Detection And Correction) framework provides a proactive solution by exposing per‑DIMM CE counts and allowing early identification of failing memory modules.
1. EDAC Overview
EDAC is a Linux kernel framework consisting of a core module (edac_core.ko) and a set of memory‑controller drivers. Its subsystems (edac_mc, edac_device, PCI bus scanning) collect error reports from memory controllers, other controllers (e.g., L3 cache), and PCI devices.
The edac_mc subsystem gathers CE and UCE events via several key functions:
edac_mc_alloc() : allocates a mem_ctl_info structure describing a memory controller.
edac_device_handle_ce() : marks a correctable error.
edac_device_handle_ue() : marks an uncorrectable error.
edac_mc_handle_error() : reports a memory event to user space, including hierarchy and error type.
edac_raw_mc_handle_error() : reports errors detected by BIOS directly.
EDAC uses the sysfs filesystem to expose the kernel’s device hierarchy. Memory controllers are represented as csrowX and chX (channel) entries, allowing the mapping of errors to specific DIMM slots.
Example sysfs path to view error counters:
# ls /sys/devices/system/edac/mc/mc0/csrow0/
ce_count ch0_ce_count ch0_dimm_label ch1_ce_count ch1_dimm_label dev_type edac_mode mem_type power size_mb subsystem ue_count ueventThe labels.db file stores the relationship between motherboard DIMM labels and the logical mc.row.channel identifiers used by EDAC.
# cat /etc/edac/labels.db
# EDAC Motherboard DIMM labels Database file.
#
# $Id: labels.db 102 2008-09-25 15:52:07Z grondo $
#
# Vendor-name and model-name are found from the program 'dmidecode'
# labels are found from the silk screen on the motherboard.
#
#Vendor:
# Model:
#
:
.
.2. EDAC Support in Linux
EDAC is supported in Linux kernels 2.6.16 and newer. The available driver modules vary by distribution and CPU architecture. The following command lists the modules present on a CentOS 7 system:
# ls /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/
amd64_edac_mod.ko.xz edac_core.ko.xz i3000_edac.ko.xz i5000_edac.ko.xz i5400_edac.ko.xz i7core_edac.ko.xz ie31200_edac.ko.xz skx_edac.ko.xz
e752x_edac.ko.xz edac_mce_amd.ko.xz i3200_edac.ko.xz i5100_edac.ko.xz i7300_edac.ko.xz i82975x_edac.ko.xz sb_edac.ko.xz x38_edac.ko.xzDriver selection depends on the CPU generation. For example:
# modinfo sb_edac
filename: /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description: MC Driver for Intel Sandy Bridge and Ivy Bridge memory controllers - Ver: 1.1.1
... # modinfo skx_edac
filename: /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/skx_edac.ko.xz
description: MC Driver for Intel Skylake server processors
...If the CPU architecture does not match any installed driver, EDAC will report “No memory controller data found”.
3. Configuration of Physical Slot Mapping
EDAC‑util uses labels.db to map sysfs identifiers to the physical slot names shown in server inventories. After editing the file, verification is performed with:
Check the number of entries in /sys/devices/system/edac using edac-ctl .
Confirm that dmidecode -t memory reports matching DIMM names.
In Vivo’s environment a packaging issue caused extra spaces in the motherboard model string, preventing edac-ctl from recognizing the model. The issue was fixed by trimming whitespace in the source code:
vim edac-utils-0.9/src/util/edac-ctl
$vendor =~ s/^\s+|\s+$//g;
$model =~ s/^\s+|\s+$//g;4. Testing and Validation via APEI Error Injection
To verify that EDAC correctly attributes CE events to the right DIMM, error injection is performed using the ACPI Platform Error Interface (APEI) EINJ table. The required kernel configuration includes:
# cat /boot/config-3.10.0-693.el7.x86_64 | grep -E "CONFIG_DEBUG_FS|CONFIG_ACPI_APEI|CONFIG_ACPI_APEI_EINJ"
CONFIG_DEBUG_FS=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=mTypical injection steps:
# ls /sys/firmware/acpi/tables/EINJ
# ls /sys/kernel/debug/apei/einj/
# cat available_error_type
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
# echo 0x8 > error_type
# echo 0xfffffffffffff000 > param2 # address mask
# echo 0x32dec000 > param1 # target address
# echo 0x0 > notrigger
# echo 1 > error_inject # trigger injection
# tail /var/log/message
... EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 ...
# edac-util -v # shows increased CE count for the DIMMSuccessful injection demonstrates that CE counters are updated for the specific DIMM, confirming the correctness of the EDAC configuration.
5. Summary and Outlook
EDAC provides per‑DIMM CE counts, enabling threshold‑based monitoring and predictive analysis. Since its deployment in Vivo’s production fleet, over 450 memory‑related incidents have been detected early, significantly reducing server crashes.
EDAC is a component of the broader RAS (Reliability, Availability, Serviceability) strategy. Future work includes integrating additional RAS mechanisms such as Machine Check Architecture (MCA) recovery to further mitigate hardware faults.
References:
EDAC driver API
Linux RAS documentation
APEI EINJ error injection guide
edac‑utils source
ACPI Platform Error Interfaces spec
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.