Tag

Horovod

1 views collected around this technical thread.

DataFunSummit
DataFunSummit
Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelGlooHorovod
0 likes · 8 min read
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention
Tencent Cloud Developer
Tencent Cloud Developer
May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIHorovodMPI
0 likes · 12 min read
Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL