Tagged articles
2 articles
Page 1 of 1
Baidu Geek Talk
Baidu Geek Talk
Apr 19, 2023 · Artificial Intelligence

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

When training large‑batch deep learning models, developers often use recompute to trade computation for memory, but in dynamic graph frameworks this can trigger synchronization errors in distributed data parallel training; the article explains the underlying DDP mechanics, illustrates the error, and offers a practical no_sync workaround with code examples.

CheckpointDistributed TrainingPyTorch
0 likes · 14 min read
Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes
Baidu Geek Talk
Baidu Geek Talk
Apr 8, 2022 · Artificial Intelligence

Golang Object Pool for Reducing GC Pressure, FFmpeg Concurrency Control, and Paddle Static vs. Dynamic Graphs

The article explains how Go's lock‑free sync.Pool can cut garbage‑collection overhead, shows practical FFmpeg thread‑parameter tuning that balances CPU use and latency for video filtering versus encoding, and compares PaddlePaddle's static and dynamic graph modes, including debugging tips and conversion to static.

Deep LearningDynamic GraphStatic Graph
0 likes · 13 min read
Golang Object Pool for Reducing GC Pressure, FFmpeg Concurrency Control, and Paddle Static vs. Dynamic Graphs