Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

Environment Migration Questions

During a recent migration of a multimodal image inference system from an internal network to a cloud environment, both setups used dual 3090 GPUs with otherwise identical hardware, yet the cloud setup showed several‑fold higher latency for single‑image inference using the qwen‑vl 2.5 7B model (bf16, non‑quantized) with vLLM, despite identical software versions and launch parameters.

Initially suspected network latency was ruled out, and the root cause was discovered: the internal machine had an NVLink 4.0 bridge connecting the two GPUs, while the cloud instance lacked this hardware. This highlighted that NVLink can significantly affect even lightweight inference workloads.

Tensor Parallelism

vLLM offers a --tensor-parallel-size (or -tp) parameter that controls the number of tensor‑parallel replicas. The default is 1.

--tensor-parallel-size, -tp
张量并行副本的数量。
默认值:1

Tensor parallelism splits large weight matrices across GPUs to enable parallel computation of matrix multiplications such as Y = X * W. When a weight matrix W is too large for a single GPU, it is divided vertically (e.g., into W1 and W2), the input X is duplicated to each GPU, and each GPU computes its partial result.

Split weights: Divide W vertically into chunks like W1, W2.

Copy inputs: Replicate the input vector X to every GPU responsible for a chunk.

Parallel compute: Each GPU multiplies its chunk of W with the full X.

GPU 0 computes Y1_part = X * W1 GPU 1 computes Y2_part = X * W2 Merge results: Concatenate Y1_part and Y2_part to obtain the final Y = [Y1_part, Y2_part].

This process mirrors the MapReduce paradigm, where data is sharded, processed in parallel, and then combined.

NVLink

NVLink bridges provide a high‑speed direct connection between GPUs, eliminating the need for CPU‑mediated data transfer over PCIe during the All‑Gather step of tensor parallelism. This dramatically reduces communication overhead, often delivering roughly ten times the bandwidth of PCIe.

Performance measurements show NVLink can provide about a ten‑fold improvement over PCIe for inter‑GPU data exchange.

To check whether a system has NVLink, run: nvidia-smi topo -m The command lists the communication protocol between each GPU pair; an entry showing “NVLink 4” confirms NVLink connectivity.

If NVLink is absent, GPUs fall back to the default PCIe bus for communication.

Data Parallel

When NVLink is unavailable or the model fits within a single GPU’s memory, data parallelism can be used. Each GPU loads the full model independently and processes separate requests without any inter‑GPU communication, effectively acting as stateless compute nodes. While this does not speed up a single request, it increases overall system throughput and fault tolerance.

vLLM does not yet provide built‑in data‑parallel support, but it can be achieved with a simple Nginx proxy and separate vLLM instances per GPU.

# On GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve your-model \
  --port 8000 \
  --max-num-seqs 8192

# On GPU 1
CUDA_VISIBLE_DEVICES=1 vllm serve your-model \
  --port 8001 \
  --max-num-seqs 8192

Configure Nginx to balance requests across the instances:

http {
  upstream vllm_servers {
    server localhost:8000;  # GPU0 instance
    server localhost:8001;  # GPU1 instance
    # Add more instances...
  }

  server {
    listen 8080;
    location / {
      proxy_pass http://vllm_servers;
      # Use least_conn algorithm
      proxy_next_upstream error timeout http_503;
    }
  }
}

This setup distributes incoming requests evenly across GPUs, keeping each request isolated.

Other Parallel Strategies

Beyond tensor and data parallelism, vLLM also supports expert parallelism (for MoE models such as DeepSeek, Qwen3MoE, Llama‑4) and pipeline parallelism. These strategies split model components or layers across GPUs in different ways, but detailed coverage is omitted here; refer to the vLLM documentation for more information.

vLLMTensor ParallelismGPU inferencemulti-GPUData ParallelNVLinkAI performance
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.