Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

The article introduces a fine‑grained activation offloading technique implemented in Megatron‑Core that offloads module‑level activations to CPU, overlaps transfer with computation, and remains compatible with pipeline and virtual pipeline parallelism, dramatically reducing peak GPU memory for large language models while incurring minimal throughput loss.

LLMMegatronMemory Optimization

0 likes · 18 min read

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

DataFunSummit

Apr 11, 2023 · Artificial Intelligence

OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation

This article introduces OneFlow Coop, a memory‑optimization technique that jointly optimizes dynamic‑graph recomputation strategies and GPU memory allocation by analyzing existing DTR limitations, proposing recomputable in‑place, op‑guided tensor allocation, and layout‑aware eviction modules, and demonstrating superior experimental results.

Deep LearningDynamic GraphGPU Memory

0 likes · 18 min read

OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation