Tag

RAS

0 views collected around this technical thread.

ByteDance SYS Tech
ByteDance SYS Tech
May 28, 2025 · Cloud Computing

How Cloud Firmware 3.0 Is Powering AI‑Driven Data‑Center Reliability

The second Firmware Technology Summit in Changsha, co‑hosted by Volcano Engine and Alibaba Cloud, gathered over 200 experts to showcase cloud firmware 3.0, OpenSFI, Efistub, AI‑focused RAS, and a new IPMI‑HTTPS security model, highlighting open‑source collaboration and industry standards for next‑gen hardware‑software integration.

AIOpen-sourceOpenBMC
0 likes · 8 min read
How Cloud Firmware 3.0 Is Powering AI‑Driven Data‑Center Reliability
ByteDance SYS Tech
ByteDance SYS Tech
Jun 7, 2024 · Operations

AI-Powered Prediction Prevents Memory Errors: Insights from IEEE RAS Data Center Summit

At the inaugural IEEE RAS in Data Center Summit (June 11‑12, California), ByteDance’s STE team presented two papers—one on AI‑guided, prediction‑based memory error prevention and another on building high‑reliability AI server head nodes—detailing challenges, solutions, and future directions for data‑center RAS.

AIMemory ErrorsRAS
0 likes · 10 min read
AI-Powered Prediction Prevents Memory Errors: Insights from IEEE RAS Data Center Summit
vivo Internet Technology
vivo Internet Technology
Jul 20, 2022 · Operations

Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers

The article explains how Vivo’s data centers use the Linux EDAC framework—combined with sysfs mapping, driver selection, and APEI error‑injection testing—to monitor per‑DIMM correctable and uncorrectable errors, enabling early fault prediction, reducing server crashes, and supporting broader RAS strategies.

EDACError InjectionLinux
0 likes · 16 min read
Applying the EDAC Framework for Memory Error Detection and Prediction in Vivo Servers
Architects' Tech Alliance
Architects' Tech Alliance
Feb 24, 2021 · Fundamentals

Error‑Correcting Code (ECC) Schemes in DDR Memory

The article explains how DDR and LPDDR memory systems use various error‑correcting code (ECC) schemes—including side‑band, inline, on‑die, and link ECC—to provide reliability, availability, and maintainability (RAS) protection against hard and soft memory errors in modern computing devices.

DDRDRAMECC
0 likes · 11 min read
Error‑Correcting Code (ECC) Schemes in DDR Memory
Architects' Tech Alliance
Architects' Tech Alliance
Aug 31, 2016 · Fundamentals

Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems

The article explains how reliability of storage, servers, and distributed systems is assessed using standards, models like MTBF/MTTR, RAS features, CAP/BASE theories, and end‑to‑end solutions, emphasizing the gap between theoretical metrics and real‑world operational data.

MTBFRASReliability
0 likes · 13 min read
Understanding Reliability Evaluation for Storage, Servers, and Distributed Systems
Architects' Tech Alliance
Architects' Tech Alliance
Jun 5, 2016 · Fundamentals

Open‑Architecture X86 Small Mainframes: Evolution, Design, and Financial Use Cases

The article reviews the historical development from closed‑architecture mainframes to open X86‑based small servers, analyzes Huawei's Kunlun and competing products, and discusses their RAS features, partitioning technologies, and suitability for high‑availability financial database workloads.

KunlunRASfinancial systems
0 likes · 15 min read
Open‑Architecture X86 Small Mainframes: Evolution, Design, and Financial Use Cases