Tagged articles
2 articles
Page 1 of 1
DataFunSummit
DataFunSummit
Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelDeep LearningDistributed Training
0 likes · 8 min read
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention