Baidu Geek Talk
Apr 19, 2023 · Artificial Intelligence
Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes
When training large‑batch deep learning models, developers often use recompute to trade computation for memory, but in dynamic graph frameworks this can trigger synchronization errors in distributed data parallel training; the article explains the underlying DDP mechanics, illustrates the error, and offers a practical no_sync workaround with code examples.
CheckpointDistributed TrainingPyTorch
0 likes · 14 min read
