Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial
This guide shows how to quickly deploy the open‑source FastChat AI assistant on Alibaba Cloud ASK's serverless Kubernetes platform, covering prerequisites, YAML configuration, GPU handling, verification steps, and three usage scenarios including web UI, API calls, and a VSCode extension.
Background
AI‑generated content (AIGC) such as ChatGPT, Midjourney, FastGPT, Moss and Stable Diffusion has spurred rapid innovation, but deploying these large models typically requires complex, resource‑intensive setups.
Alibaba Serverless Kubernetes (ASK)
ASK (Alibaba Serverless Kubernetes) is a fully managed, serverless container service. Users create workloads via the standard Kubernetes API without managing nodes. ASK automatically scales, isolates workloads, and provides a free‑trial quota.
Challenges for Large‑Scale AI Model Deployment
GPU scarcity and high cost : Training and inference need GPUs that many developers cannot afford.
Heterogeneous GPU environments : Different GPU series require matching CUDA, driver and nvidia‑container‑cli versions, adding configuration overhead.
Large container images : AI images often exceed dozens of gigabytes, leading to slow pull times.
Solution with ASK
ASK abstracts GPU resources, CUDA dependencies and image handling. Pods can request GPUs on demand, and ASK caches container images so that a second pod starts in seconds.
Deployment Procedure
Prerequisites
Create an ASK cluster (refer to Alibaba documentation).
Download the LLaMA‑7B model and upload it to OSS.
Configure YAML Replace the placeholders in secret.yaml and deployment.yaml with your credentials and OSS path: ${your-ak} – AccessKey ID ${your-sk} – AccessKey Secret ${oss-endpoint-url} – OSS endpoint (e.g., oss-cn-shanghai.aliyuncs.com) ${llama-oss-path} – OSS object path, e.g. oss://my-bucket/llama-7b-hf (no trailing slash).
Apply resources
kubectl apply -f secret.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yamlWait for readiness When the FastChat pod reaches the Ready state, open a browser at http://<external‑ip>:7860 to view the web UI.
Verification
Check pod and service status:
kubectl get po -l app=fastchat
# Example output:
# fastchat-69ff78cf46-tpbvp 1/1 Running 0 20m
kubectl get svc fastchat
# Example output:
# fastchat LoadBalancer 192.168.230.108 203.0.113.45 7860:31444/TCP 22mUse Cases
Console Interaction
Open the web UI and ask FastChat to generate code, e.g. “Create a Kubernetes Deployment for an Nginx image”. The model returns a ready‑to‑use YAML file.
API Call
FastChat exposes an HTTP API on port 8000. Example curl request:
curl http://<external‑ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.1",
"messages": [{"role": "user", "content": "golang 生成一个 hello world"}]
}'VSCode Extension
A minimal VSCode extension can call the FastChat API and insert the generated snippet into the active editor. The three essential files are shown below.
// extension.ts
import * as vscode from 'vscode';
import axios from 'axios';
export function activate(context: vscode.ExtensionContext) {
const fastchat = async () => {
const input = await vscode.window.showInputBox({ prompt: 'Enter code prompt' });
if (!input) return;
const response = await axios.post('http://<external‑ip>:8000/v1/chat/completions', {
model: 'vicuna-7b-v1.1',
messages: [{ role: 'user', content: input }]
}, { headers: { 'Content-Type': 'application/json' } });
const content = response.data.choices[0].message.content;
const match = content.match(/```[\s\S]*?
([\s\S]*?)```/);
if (match) {
const editor = vscode.window.activeTextEditor;
editor?.edit(editBuilder => {
const pos = editor.selection.active;
editBuilder.insert(pos, match[1].trim());
});
}
};
const cmd = vscode.commands.registerCommand('fastchat', fastchat);
context.subscriptions.push(cmd);
}
// package.json
{
"name": "fastchat",
"version": "1.0.0",
"publisher": "yourname",
"engines": { "vscode": "^1.0.0" },
"activationEvents": ["onCommand:fastchat"],
"main": "./dist/extension.js",
"contributes": { "commands": [{ "command": "fastchat", "title": "fastchat code generator" }] },
"devDependencies": { "@types/node": "^18.16.1", "@types/vscode": "^1.77.0", "axios": "^1.3.6", "typescript": "^5.0.4" }
}
// tsconfig.json
{
"compilerOptions": {
"target": "ES2018",
"module": "commonjs",
"outDir": "./dist",
"strict": true,
"esModuleInterop": true,
"resolveJsonModule": true,
"declaration": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "**/*.test.ts"]
}Conclusion
ASK’s serverless, auto‑scaling, GPU‑abstraction and image‑caching capabilities make it an ideal platform for deploying large AI models such as LLaMA‑7B or Vicuna‑7B, eliminating the operational overhead of node management and image handling.
Appendix
Model download (LLaMA‑7B) from Hugging Face:
# Install git‑lfs on an Alibaba Cloud ECS instance
yum install -y git-lfs
# Clone the repository containing the model files
git clone https://huggingface.co/decapoda-research/llama-7b-hf
# Initialise and pull LFS objects
git lfs install
git lfs pullUpload the model directory to OSS using the OSS command‑line tool (see Alibaba Cloud OSS documentation).
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
