Artificial Intelligence 15 min read

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

This article provides a detailed technical analysis of DeepSpeed Ulysses, explaining its sequence‑parallel workflow, comparing its communication volume with Megatron, and examining how All2All operations and Zero‑3 integration affect scalability and efficiency.

Baobao Algorithm Notes

Nov 4, 2024

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

1. Overall Ulysses Workflow

Ulysses processes a sequence of length N with hidden size d across P GPUs (where P is also treated as the number of heads). The input X of shape (N, d) is split into P sequence chunks of size (N/P, d), each assigned to a GPU.

N = seq_len

d = hidden_size

P = gpu_num

(interpreted as the head count)

The forward pass proceeds as follows:

Chunk the input along the sequence dimension. Each GPU receives a (N/P, d) chunk.

Compute local Q, K, V chunks. Because the full model matrix (d, d) resides on every GPU, each GPU produces Q/K/V chunks of shape (N/P, d).

All2All communication of Q/K/V. After an All2All, each GPU holds Q/K/V for all sequences but only one head, reshaped to (N, d/P).

Local attention computation. Each GPU computes attention on its head, producing an output chunk of shape (N, d/P).

Second All2All to restore the original chunk shape. The output is reshaped back to (N/P, d).

MLP computation. Since MLPs are independent across sequence chunks, each GPU processes its chunk without further communication.

Repeat until loss is computed.

The loss for a single GPU corresponds to the loss of its assigned sequence chunk, implying that gradient aggregation will later require an AllReduce across GPUs.

2. Megatron vs. Ulysses Communication

2.1 Megatron Communication Volume

Megatron uses tensor parallelism (TP) and pipeline parallelism (PP). For the attention block, each forward pass performs one all‑gather and one reduce‑scatter; the backward pass performs the same two operations in reverse. The MLP block adds two more all‑gather and two more reduce‑scatter operations. Ignoring extra all‑gather steps that can be overlapped, the total communication per layer is 4 all‑gather + 4 reduce‑scatter, each costing Nd, resulting in 8Nd communication volume.

2.2 Ulysses Communication Volume

Ulysses relies on All2All operations. Each All2All moves data of size (N·d)/P per GPU. The forward pass performs four All2Alls (three for Q/K/V and one for the combined attention result). The backward pass also performs four All2Alls for gradients of Q, K, V, and the attention output. Assuming no overlap, the total number of All2Alls is eight, each costing (N·d)/P, giving a total communication volume of (8Nd)/P.

2.3 Comparison

Megatron TP+PP: Fixed per‑GPU communication of 8Nd, independent of the number of GPUs.

DeepSpeed Ulysses: Per‑GPU communication of (8Nd)/P, which can be reduced by increasing P (i.e., the number of heads/GPU). However, P is limited by the head count, so scalability is not unlimited.

3. Ulysses + Zero‑3 Integration

When combined with ZeRO‑3, model weights are partitioned into M0 – M3 across four GPUs (for sp_size = 2, dp_size = 2). Before the forward pass, an all‑gather distributes the full weight to each GPU, after which the standard Ulysses workflow proceeds. This adds an extra communication step but retains the same overall pattern.

4. References

https://arxiv.org/pdf/2309.14509

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/sequence/layer.py

https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py

https://www.deepspeed.ai/tutorials/ds-sequence/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSpeed sequence parallelism Megatron All2All communication volume Ulysses

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.