14 min read

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba Cloud Native

Feb 13, 2025

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

The rapid growth of AI models like DeepSeek has created powerful inference capabilities, but individual users and enterprises face steep hurdles in deploying these large language models (LLMs) due to complex setup, high resource demands, and operational costs.

What is vLLM?

vLLM is an open‑source inference engine that simplifies model access, allowing users to send inference requests without handling the heavy lifting of model loading and scheduling. It lowers the technical barrier and brings LLM capabilities closer to end‑users.

Challenges of Scaling vLLM

Enterprises encounter three major issues when scaling vLLM:

Massive parameter size – Models can occupy tens to hundreds of gigabytes, causing long download times and GPU loading delays.

High‑performance inference – Real‑time interaction requires sub‑second responses, demanding efficient caching and continuous conversation handling.

Context continuity – Maintaining dialogue state across requests needs stable long‑lived connections and coordinated resource management.

Beyond these, large‑scale GPU clusters must manage resource utilization, peak‑valley load patterns, uneven load distribution, and the high cost of GPUs, leading to the so‑called “impossible triangle” of performance, cost, and stability.

FC GPU Reserved Instances: A Balanced Solution

Alibaba Cloud Function Compute (FC) offers GPU reserved instances with idle‑billing, directly addressing the triangle:

Performance optimization – Pre‑started vLLM instances eliminate cold‑start latency; FC’s cache reuse ensures fast responses even under high concurrency.

Cost control – Idle‑billing charges only a small fee for reserved capacity; active usage follows normal pricing, and timed reservations allow dynamic scaling.

Stability guarantee – FC’s custom scheduler balances GPU memory and container placement, supports up to 24‑hour long‑connections and WebSocket calls, keeping services stable during traffic spikes.

Step‑by‑Step vLLM Deployment on FC

Upload the official vLLM Docker image to Alibaba Cloud Container Registry (ACR).

Create a new GPU function in the FC console, selecting the appropriate runtime.

Configure the startup command, adding --enforce-eager to disable eager mode.

python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model ${NAS_PATH} --trust-remote-code --served-model-name ${MODEL_NAME} ... --port ${PORT}

Select an Ada‑series GPU (e.g., fc.gpu.ada.1) with full‑card memory.

Complete function creation and wait for initialization.

Mount the model from NAS for centralized management.

python3 -m vllm.entrypoints.openai.api_server --model /prod/models --trust-remote-code --served-model-name Qwen/Qwen-14B-Chat --gpu-memory-utilization 0.9 --max-model-len 4096 --port 8080

Configure reserved instances and enable idle‑billing to keep a pool of ready vLLM services.

(Optional) Bind a custom domain for direct HTTP/WebSocket access.

vLLM Application Integration

After deployment, vLLM can be exposed directly via a custom domain, or further wrapped into higher‑level services. FC handles instance startup, scheduling, load balancing, and affinity, allowing developers to focus on business logic.

For users who prefer an even simpler experience, the Cloud Application Platform (CAP) abstracts the deployment process, enabling one‑click vLLM launches without managing the underlying instances.

Conclusion

By leveraging FC GPU reserved instances with idle‑billing, enterprises can achieve a practical balance among performance, cost, and stability when running large‑scale vLLM workloads, while retaining efficient development and operations workflows.

vLLM Alibaba Cloud deployment guide GPU Reserved Instances

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.