Operations 16 min read

Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers

The article explains how Vivo’s data centers use the Linux EDAC framework—combined with sysfs mapping, driver selection, and APEI error‑injection testing—to monitor per‑DIMM correctable and uncorrectable errors, enabling early fault prediction, reducing server crashes, and supporting broader RAS strategies.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers

With the rapid growth of internet services, the availability of infrastructure has become a critical concern. Memory failures, which occur frequently and have a high impact, are the second most common hardware fault after disks in Vivo’s data centers, where more than 400,000 DIMMs are deployed. Uncorrectable errors (UCE) can cause immediate server crashes, while an accumulation of correctable errors (CE) eventually leads to UCE.

The traditional approach of relying on Machine Check Exception (MCE) logs and BMC SEL records can only detect faults after a crash. The EDAC (Error Detection And Correction) framework provides a proactive solution by exposing per‑DIMM CE counts and allowing early identification of failing memory modules.

1. EDAC Overview

EDAC is a Linux kernel framework consisting of a core module (edac_core.ko) and a set of memory‑controller drivers. Its subsystems (edac_mc, edac_device, PCI bus scanning) collect error reports from memory controllers, other controllers (e.g., L3 cache), and PCI devices.

The edac_mc subsystem gathers CE and UCE events via several key functions:

edac_mc_alloc() : allocates a mem_ctl_info structure describing a memory controller.

edac_device_handle_ce() : marks a correctable error.

edac_device_handle_ue() : marks an uncorrectable error.

edac_mc_handle_error() : reports a memory event to user space, including hierarchy and error type.

edac_raw_mc_handle_error() : reports errors detected by BIOS directly.

EDAC uses the sysfs filesystem to expose the kernel’s device hierarchy. Memory controllers are represented as csrowX and chX (channel) entries, allowing the mapping of errors to specific DIMM slots.

Example sysfs path to view error counters:

# ls /sys/devices/system/edac/mc/mc0/csrow0/
ce_count  ch0_ce_count  ch0_dimm_label  ch1_ce_count  ch1_dimm_label  dev_type  edac_mode  mem_type  power  size_mb  subsystem  ue_count  uevent

The labels.db file stores the relationship between motherboard DIMM labels and the logical mc.row.channel identifiers used by EDAC.

# cat /etc/edac/labels.db
# EDAC Motherboard DIMM labels Database file.
#
# $Id: labels.db 102 2008-09-25 15:52:07Z grondo $
#
#  Vendor-name and model-name are found from the program 'dmidecode'
#  labels are found from the silk screen on the motherboard.
#
#Vendor:
#  Model:
#
:
.
.

2. EDAC Support in Linux

EDAC is supported in Linux kernels 2.6.16 and newer. The available driver modules vary by distribution and CPU architecture. The following command lists the modules present on a CentOS 7 system:

# ls /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/
amd64_edac_mod.ko.xz  edac_core.ko.xz     i3000_edac.ko.xz  i5000_edac.ko.xz  i5400_edac.ko.xz  i7core_edac.ko.xz   ie31200_edac.ko.xz  skx_edac.ko.xz
 e752x_edac.ko.xz      edac_mce_amd.ko.xz  i3200_edac.ko.xz  i5100_edac.ko.xz  i7300_edac.ko.xz  i82975x_edac.ko.xz  sb_edac.ko.xz       x38_edac.ko.xz

Driver selection depends on the CPU generation. For example:

# modinfo sb_edac
filename:       /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description:    MC Driver for Intel Sandy Bridge and Ivy Bridge memory controllers -  Ver: 1.1.1
...
# modinfo skx_edac
filename:       /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/skx_edac.ko.xz
description:    MC Driver for Intel Skylake server processors
...

If the CPU architecture does not match any installed driver, EDAC will report “No memory controller data found”.

3. Configuration of Physical Slot Mapping

EDAC‑util uses labels.db to map sysfs identifiers to the physical slot names shown in server inventories. After editing the file, verification is performed with:

Check the number of entries in /sys/devices/system/edac using edac-ctl .

Confirm that dmidecode -t memory reports matching DIMM names.

In Vivo’s environment a packaging issue caused extra spaces in the motherboard model string, preventing edac-ctl from recognizing the model. The issue was fixed by trimming whitespace in the source code:

vim edac-utils-0.9/src/util/edac-ctl
$vendor =~ s/^\s+|\s+$//g;
$model  =~ s/^\s+|\s+$//g;

4. Testing and Validation via APEI Error Injection

To verify that EDAC correctly attributes CE events to the right DIMM, error injection is performed using the ACPI Platform Error Interface (APEI) EINJ table. The required kernel configuration includes:

# cat /boot/config-3.10.0-693.el7.x86_64 | grep -E "CONFIG_DEBUG_FS|CONFIG_ACPI_APEI|CONFIG_ACPI_APEI_EINJ"
CONFIG_DEBUG_FS=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=m

Typical injection steps:

# ls /sys/firmware/acpi/tables/EINJ
# ls /sys/kernel/debug/apei/einj/
# cat available_error_type
0x00000008  Memory Correctable
0x00000010  Memory Uncorrectable non-fatal
0x00000020  Memory Uncorrectable fatal
# echo 0x8 > error_type
# echo 0xfffffffffffff000 > param2   # address mask
# echo 0x32dec000 > param1          # target address
# echo 0x0 > notrigger
# echo 1 > error_inject            # trigger injection
# tail /var/log/message
... EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 ...
# edac-util -v   # shows increased CE count for the DIMM

Successful injection demonstrates that CE counters are updated for the specific DIMM, confirming the correctness of the EDAC configuration.

5. Summary and Outlook

EDAC provides per‑DIMM CE counts, enabling threshold‑based monitoring and predictive analysis. Since its deployment in Vivo’s production fleet, over 450 memory‑related incidents have been detected early, significantly reducing server crashes.

EDAC is a component of the broader RAS (Reliability, Availability, Serviceability) strategy. Future work includes integrating additional RAS mechanisms such as Machine Check Architecture (MCA) recovery to further mitigate hardware faults.

References:

EDAC driver API

Linux RAS documentation

APEI EINJ error injection guide

edac‑utils source

ACPI Platform Error Interfaces spec

LinuxEDACError InjectionMemory Error DetectionRASserver reliability
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.