12 min read

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

This guide shows how to quickly deploy the open‑source FastChat AI assistant on Alibaba Cloud ASK's serverless Kubernetes platform, covering prerequisites, YAML configuration, GPU handling, verification steps, and three usage scenarios including web UI, API calls, and a VSCode extension.

Alibaba Cloud Native

May 1, 2023

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

Background

AI‑generated content (AIGC) such as ChatGPT, Midjourney, FastGPT, Moss and Stable Diffusion has spurred rapid innovation, but deploying these large models typically requires complex, resource‑intensive setups.

Alibaba Serverless Kubernetes (ASK)

ASK (Alibaba Serverless Kubernetes) is a fully managed, serverless container service. Users create workloads via the standard Kubernetes API without managing nodes. ASK automatically scales, isolates workloads, and provides a free‑trial quota.

Challenges for Large‑Scale AI Model Deployment

GPU scarcity and high cost : Training and inference need GPUs that many developers cannot afford.

Heterogeneous GPU environments : Different GPU series require matching CUDA, driver and nvidia‑container‑cli versions, adding configuration overhead.

Large container images : AI images often exceed dozens of gigabytes, leading to slow pull times.

Solution with ASK

ASK abstracts GPU resources, CUDA dependencies and image handling. Pods can request GPUs on demand, and ASK caches container images so that a second pod starts in seconds.

Deployment Procedure

Prerequisites

Create an ASK cluster (refer to Alibaba documentation).

Download the LLaMA‑7B model and upload it to OSS.

Configure YAML Replace the placeholders in secret.yaml and deployment.yaml with your credentials and OSS path: ${your-ak} – AccessKey ID ${your-sk} – AccessKey Secret ${oss-endpoint-url} – OSS endpoint (e.g., oss-cn-shanghai.aliyuncs.com) ${llama-oss-path} – OSS object path, e.g. oss://my-bucket/llama-7b-hf (no trailing slash).

Apply resources

kubectl apply -f secret.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Wait for readiness When the FastChat pod reaches the Ready state, open a browser at http://<external‑ip>:7860 to view the web UI.

Verification

Check pod and service status:

kubectl get po -l app=fastchat
# Example output:
# fastchat-69ff78cf46-tpbvp   1/1   Running   0   20m

kubectl get svc fastchat
# Example output:
# fastchat   LoadBalancer   192.168.230.108   203.0.113.45   7860:31444/TCP   22m

Use Cases

Console Interaction

Open the web UI and ask FastChat to generate code, e.g. “Create a Kubernetes Deployment for an Nginx image”. The model returns a ready‑to‑use YAML file.

API Call

FastChat exposes an HTTP API on port 8000. Example curl request:

curl http://<external‑ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.1",
    "messages": [{"role": "user", "content": "golang 生成一个 hello world"}]
  }'

VSCode Extension

A minimal VSCode extension can call the FastChat API and insert the generated snippet into the active editor. The three essential files are shown below.

// extension.ts
import * as vscode from 'vscode';
import axios from 'axios';
export function activate(context: vscode.ExtensionContext) {
  const fastchat = async () => {
    const input = await vscode.window.showInputBox({ prompt: 'Enter code prompt' });
    if (!input) return;
    const response = await axios.post('http://<external‑ip>:8000/v1/chat/completions', {
      model: 'vicuna-7b-v1.1',
      messages: [{ role: 'user', content: input }]
    }, { headers: { 'Content-Type': 'application/json' } });
    const content = response.data.choices[0].message.content;
    const match = content.match(/```[\s\S]*?
([\s\S]*?)```/);
    if (match) {
      const editor = vscode.window.activeTextEditor;
      editor?.edit(editBuilder => {
        const pos = editor.selection.active;
        editBuilder.insert(pos, match[1].trim());
      });
    }
  };
  const cmd = vscode.commands.registerCommand('fastchat', fastchat);
  context.subscriptions.push(cmd);
}

// package.json
{
  "name": "fastchat",
  "version": "1.0.0",
  "publisher": "yourname",
  "engines": { "vscode": "^1.0.0" },
  "activationEvents": ["onCommand:fastchat"],
  "main": "./dist/extension.js",
  "contributes": { "commands": [{ "command": "fastchat", "title": "fastchat code generator" }] },
  "devDependencies": { "@types/node": "^18.16.1", "@types/vscode": "^1.77.0", "axios": "^1.3.6", "typescript": "^5.0.4" }
}

// tsconfig.json
{
  "compilerOptions": {
    "target": "ES2018",
    "module": "commonjs",
    "outDir": "./dist",
    "strict": true,
    "esModuleInterop": true,
    "resolveJsonModule": true,
    "declaration": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules", "**/*.test.ts"]
}

Conclusion

ASK’s serverless, auto‑scaling, GPU‑abstraction and image‑caching capabilities make it an ideal platform for deploying large AI models such as LLaMA‑7B or Vicuna‑7B, eliminating the operational overhead of node management and image handling.

Appendix

Model download (LLaMA‑7B) from Hugging Face:

# Install git‑lfs on an Alibaba Cloud ECS instance
yum install -y git-lfs
# Clone the repository containing the model files
git clone https://huggingface.co/decapoda-research/llama-7b-hf
# Initialise and pull LFS objects
git lfs install
git lfs pull

Upload the model directory to OSS using the OSS command‑line tool (see Alibaba Cloud OSS documentation).

serverless AI deployment Kubernetes FastChat ASK

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.