LIBRA and CARE: Memory Bandwidth Management and Fault‑Tolerance Innovations Presented at HPCA 2021
The article reviews two HPCA 2021 papers from Alibaba Cloud—LIBRA, a dynamic memory‑bandwidth management framework that boosts data‑center utilization, and CARE, a cache‑based fault‑tolerance architecture that delivers near‑Chipkill reliability with minimal overhead—while also highlighting future research directions in ML systems, quantum computing, and cache computing.
HPCA (High‑Performance Computer Architecture) is one of the most prestigious conferences in computer architecture and high‑performance computing. Two recent papers authored by Alibaba Cloud infrastructure experts were presented at HPCA 2021, focusing on data‑center resource utilization and reliability.
LIBRA addresses the challenge of allocating shared memory‑bandwidth in heterogeneous data‑center workloads. Existing server‑chip bandwidth controls suffer from poor flexibility and slow response, leading to resource waste. LIBRA introduces a novel Dynamic‑Rate‑Control (DRC) technique that dynamically throttles low‑priority jobs, dramatically improving the performance of high‑priority workloads, increasing server utilization, and reducing total‑cost‑of‑ownership.
CARE (Coordinated Augmentation for Elastic Resilience on DRAM Errors) tackles the growing impact of memory errors as compute density and DRAM capacity rise. Traditional ECC solutions impose large performance, power, or capacity penalties, or require extensive system changes. CARE adds a cache‑like structure inside the memory controller to collect error statistics and perform proactive correction, achieving reliability close to Chipkill without sacrificing memory capacity and with negligible performance overhead.
The sharing session concluded with the experts’ outlook on future directions in computer architecture: continued momentum for machine‑learning systems and accelerators, rapid advances in quantum computing and its integration into architectural research, and ongoing challenges in cache computing.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.