Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM
This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.
The ACK Gateway with Inference Extension component is designed for LLM inference scenarios, offering four‑layer/seven‑layer traffic routing and load‑balancing based on model‑server load awareness, while allowing custom traffic‑splitting strategies like model gray‑release and traffic mirroring via the InferencePool and InferenceModel CRDs.
vLLM provides high‑performance inference for massive language models by employing tensor parallelism (splitting weight matrices across GPUs) and pipeline parallelism (partitioning model layers across devices), enabling efficient multi‑node deployment.
Environment preparation includes creating a GPU‑enabled Kubernetes cluster, ensuring at least four GPUs across nodes, and installing the LeaderWorkerSet controller.
Step 1 – Model data : download the QwQ‑32B model, push it to OSS, and configure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) for the model files.
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull ossutil mkdir oss://<Your-Bucket-Name>/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<Your-Bucket-Name>/QwQ-32BStep 2 – Deploy inference service : apply a LeaderWorkerSet YAML that creates a leader pod and a worker pod (each with 2 GPUs) forming a Ray cluster, and runs vLLM with tensor‑parallel‑size 2 and pipeline‑parallel‑size 2.
kubectl apply -f- <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: qwq-dist-v1-cm
labels:
app: distributed-serving
release: qwq-dist-v1
role: leader
servingName: qwq-dist
servingType: distributed-serving
servingVersion: v1
data:
hostfile-0: |-
qwq-dist-v1.qwq-dist-v1-0.default
qwq-dist-v1.qwq-dist-v1-0-0.default
...
EOFStep 3 – ACK Gateway configuration : create a GatewayClass and a Gateway exposing ports 8080 (standard HTTP routing) and 8081 (inference‑extension routing), then define a BackendTrafficPolicy, ClientTrafficPolicy, and an HTTPRoute that forwards traffic to the distributed‑serving Service.
kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: inference-gateway
listeners:
- name: http
protocol: HTTP
port: 8080
- name: llm-gw
protocol: HTTP
port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: reasoning-backend
spec:
parentRefs:
- name: inference-gateway
sectionName: http
rules:
- backendRefs:
- kind: Service
name: qwq-dist-v1
port: 8000
weight: 1
matches:
- path:
type: PathPrefix
value: /
EOFStep 4 – Enable inference extension : create an InferencePool that selects the leader pods and an InferenceModel that routes 100 % of requests for the model name qwq to the QwQ‑32B model.
kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
name: reasoning-pool
spec:
selector:
app: distributed-serving
release: qwq-dist-v1
role: leader
targetPortNumber: 8000
EOF kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: reasoning-model
spec:
modelName: qwq
targetModels:
- name: qwq
weight: 100
EOFStep 5 – Benchmarking : deploy a vLLM‑benchmark pod, download a ShareGPT dataset, and run the provided Python benchmark script against both ports (8080 and 8081). The results show that the intelligent routing of ACK Gateway reduces average TTFT from 10,909 ms to 7,336 ms (≈32 % improvement) and slightly increases token throughput.
Overall, the combination of ACK Gateway’s load‑aware routing, vLLM’s parallelism, and Kubernetes orchestration delivers lower latency, higher throughput, and better cache utilization for large‑scale LLM inference workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
