Operations 10 min read

AI-Powered Prediction Prevents Memory Errors: Insights from IEEE RAS Data Center Summit

At the inaugural IEEE RAS in Data Center Summit (June 11‑12, California), ByteDance’s STE team presented two papers—one on AI‑guided, prediction‑based memory error prevention and another on building high‑reliability AI server head nodes—detailing challenges, solutions, and future directions for data‑center RAS.

ByteDance SYS Tech

Jun 7, 2024

AI-Powered Prediction Prevents Memory Errors: Insights from IEEE RAS Data Center Summit

IEEE RAS in Data Center Summit Overview

The first IEEE RAS in Data Center Summit will be held on June 11‑12 in California, focusing on Reliability, Availability, and Serviceability (RAS) in data centers. The program includes 30 papers covering Data Center RAS, Memory and Interconnects, AI and RAS, and Testing and Resilience.

Selected Papers from ByteDance STE Team

Paper 1: Reducing Memory Errors On‑the‑fly with Prediction‑Guided Failure Prevention

This work builds on previous research on memory error prediction and feature analysis (2021‑2022). Using large‑scale AI models, the authors predict imminent memory faults and apply system‑level RAS actions to repair them before they affect the system. The solution has been deployed on ByteDance DDR4/DDR5 servers with Intel Xeon processors, achieving significant reductions in both uncorrectable (UE) and corrected (CE) errors.

Paper 2: Build High Reliability / Availability / Serviceability Head Node for AI Server

AI servers, equipped with multiple GPUs and large memory/storage, exhibit fault rates roughly six times higher than conventional servers. By constructing a detailed fault‑data repository based on CPU‑provided RAS capabilities and analyzing fault distributions, the authors address PCIe peripheral errors, error correction, UE recovery, and hot‑plug stability. The paper discusses engineering challenges, solutions, and future improvement directions for AI server head nodes.

Additional Context

The team also presented related findings at IEEE International Conference on Computer Design (ICCD) and SuperComputing (SC22), covering algorithmic memory error prediction and the relationship between corrected and uncorrectable errors.

AI Reliability data center RAS Memory Errors

Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.