AI-Powered Prediction Prevents Memory Errors: Insights from IEEE RAS Data Center Summit
At the inaugural IEEE RAS in Data Center Summit (June 11‑12, California), ByteDance’s STE team presented two papers—one on AI‑guided, prediction‑based memory error prevention and another on building high‑reliability AI server head nodes—detailing challenges, solutions, and future directions for data‑center RAS.
IEEE RAS in Data Center Summit Overview
The first IEEE RAS in Data Center Summit will be held on June 11‑12 in California, focusing on Reliability, Availability, and Serviceability (RAS) in data centers. The program includes 30 papers covering Data Center RAS, Memory and Interconnects, AI and RAS, and Testing and Resilience.
Selected Papers from ByteDance STE Team
Paper 1: Reducing Memory Errors On‑the‑fly with Prediction‑Guided Failure Prevention
This work builds on previous research on memory error prediction and feature analysis (2021‑2022). Using large‑scale AI models, the authors predict imminent memory faults and apply system‑level RAS actions to repair them before they affect the system. The solution has been deployed on ByteDance DDR4/DDR5 servers with Intel Xeon processors, achieving significant reductions in both uncorrectable (UE) and corrected (CE) errors.
Paper 2: Build High Reliability / Availability / Serviceability Head Node for AI Server
AI servers, equipped with multiple GPUs and large memory/storage, exhibit fault rates roughly six times higher than conventional servers. By constructing a detailed fault‑data repository based on CPU‑provided RAS capabilities and analyzing fault distributions, the authors address PCIe peripheral errors, error correction, UE recovery, and hot‑plug stability. The paper discusses engineering challenges, solutions, and future improvement directions for AI server head nodes.
Additional Context
The team also presented related findings at IEEE International Conference on Computer Design (ICCD) and SuperComputing (SC22), covering algorithmic memory error prediction and the relationship between corrected and uncorrectable errors.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.