DataFunSummit
Nov 29, 2021 · Artificial Intelligence
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention
This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.
Data ParallelGlooHorovod
0 likes · 8 min read