Alibaba’s Distributed Training Boosts Neural Machine Translation Speed
Since its 2013 debut, Neural Machine Translation (NMT) has approached human quality, but training costs are high; Alibaba’s team developed a distributed NMT system in 2017, employing data‑parallel, model‑average, BMUF, Downpour SGD, and Ring‑allReduce techniques to cut training time from over 20 days to a few days while maintaining translation quality.
Background
Neural Machine Translation (NMT) was first proposed in academia in 2013 and has rapidly improved, reaching near‑human translation quality for some language pairs and scenarios.
Alibaba’s translation team began independent NMT research in October 2016 and, by November, applied NMT outputs to Chinese‑English messaging with notable quality gains.
Training Challenges and Distributed Solution
Because NMT models are large and training requires massive computation, a single‑GPU training on 30 million sentence pairs can take more than 20 days. To address this, in February 2017 Alibaba’s translation team collaborated with Alibaba Cloud’s Large Scale Learning group to develop a distributed NMT system, completing the first version by the end of March 2017.
Project Results
In an English‑Russian e‑commerce translation project (April 2017), the distributed system reduced training time from 20 days to 4 days, greatly accelerating iteration.
With 4 GPUs the convergence speedup exceeds 3×, with 8 GPUs over 5×, and with 16 GPUs over 9×; further scaling is expected to continue improving the acceleration ratio.
Implementation Details
Data Parallel (Synchronous SGD)
Each worker holds a full model replica and processes a portion of each mini‑batch. After computing local gradients, workers push gradients to a parameter server and pull updated parameters. The total communication per iteration is 2 × num_of_gpus × model_size.
Hybrid Parallel
Combines data parallel for convolutional layers with model parallel for fully‑connected layers. For NMT models, layer sizes and computation are similar, making hybrid parallel less effective.
Exploring Distributed Strategies
Beyond basic model averaging (MA), the team evaluated Downpour SGD, AllReduce SGD, and BMUF (Blockwise Model‑Update Filtering) to improve scalability.
Model Average Scheme
Each worker trains locally; periodically the workers’ models are averaged and the average is broadcast as the new baseline. Implementation uses TensorFlow graphs split into forward‑backward sub‑graphs on workers, a Reduce‑Average sub‑graph on the parameter server, and a broadcast sub‑graph.
Hyper‑parameter Tuning
When using 2 machines with 4 GPUs, keeping baseline batch size and learning rate leads to a maximum speedup of 1.5×; the learning rate should be scaled by the number of GPUs to compensate for gradient averaging.
BMUF Enhancement
BMUF adds a momentum term to the averaged gradients, reducing the need to increase learning rate as GPUs scale. The method introduces block learning rate (blr) and block momentum rate (bmr) to keep effective learning rate stable.
Experimental results show BMUF achieves comparable or better convergence than plain MA while maintaining higher learning rates.
Downpour SGD (Asynchronous SGD)
Workers pull model weights from the parameter server, train locally, accumulate gradients, and asynchronously push accumulated gradients back. Additional hyper‑parameters include push/pull step intervals and gradient clipping norm to avoid NaNs.
Experiments indicate that larger batch sizes and learning rates improve speed but may degrade accuracy; careful tuning of step intervals and clipping norms is required.
Ring‑AllReduce SGD
Ring‑AllReduce splits the model into shards, performs a reduce‑scatter followed by an all‑gather, keeping total communication roughly constant regardless of GPU count. Compared with gRPC, CUDA‑aware MPI offers lower latency.
With a total batch size eight times the baseline, the system achieved a 4× compute speedup on InfiniBand (10 GB/s) and 2.56× on 10 Gb Ethernet, though convergence peak was slightly lower than baseline.
Future Work
Next steps include further exploiting distributed training acceleration through system‑algorithm co‑optimization, decoupling optimization strategies from model architecture for componentized scaling, and applying model compression and architecture simplification to improve inference speed and reduce online latency and hardware cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
