Deploy ChatGLM2‑6B on UCloud K8S: Complete Guide to Large Language Model Inference
This article reviews the architectures, training methods, and key characteristics of major open‑source large language models such as BERT, GPT, T5, LLaMA and ChatGLM, and then provides a step‑by‑step tutorial for deploying ChatGLM2‑6B on UCloud's UK8S platform using UFS storage, Kubernetes manifests, and command‑line tools.
To meet customer demand for large‑model usage, UCloud’s image market now offers deployment and compute scheduling for open‑source models including Alpaca‑LoRA, ChatGLM, T5, MiniGPT‑4, Stable Diffusion, LLaMA2 and Milvus vector database, enabling rapid setup of fine‑tuning or inference environments.
Model Architectures and Characteristics
Since Google introduced the Transformer in 2017, mainstream models such as GPT, BERT, T5, ChatGLM and the LLaMA series have been built on this core architecture.
BERT
Uses the Transformer encoder. Characteristics: bidirectional attention that captures full context, excels at text understanding, not suited for generation tasks.
GPT
Uses the Transformer decoder. Characteristics: unidirectional (left‑to‑right) attention, ideal for text generation.
T5
Adopts an encoder‑decoder structure. Modifications: removed bias from layer‑norm, moved layer‑norm outside the residual path, introduced relative position encoding after the first self‑attention query‑key multiplication. Characteristics: encoder attention is bidirectional, decoder attention is unidirectional, handles both understanding and generation, and has a large parameter count.
LLaMA
Uses the Transformer decoder. Modifications: pre‑norm (normalizing inputs of each sub‑layer), SwiGLU activation replacing ReLU, rotary embeddings replacing absolute position embeddings. Characteristics: LLaMA‑13B is ten times smaller than GPT‑3 (175B) yet outperforms it on most benchmarks; Chinese performance is weak because Chinese corpora were not included in pre‑training.
ChatGLM
Based on GLM‑130B, combines advantages of BERT, GPT and T5 via a custom mask matrix. Modifications: custom mask matrix, reordered layer‑norm and residual connections, added a separate linear layer for output prediction, replaced ReLU with GeLU, introduced 2‑D position encoding. Characteristics: integrates the strengths of the three models.
Training Methods and Objectives
All large language models share a two‑stage paradigm: massive unsupervised pre‑training on unlabeled text, followed by supervised fine‑tuning on downstream tasks.
BERT
Pre‑training objectives include Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). After pre‑training, the model is fine‑tuned on task‑specific labeled data.
T5
Training consists of Encoder‑Decoder Pretraining (masking tokens and predicting them) and Denoising Auto‑Encoder Pretraining (adding noise or random permutations and reconstructing the original sequence). The pre‑training objective is similar to MLM but can mask contiguous spans.
GPT
First stage: unsupervised pre‑training on large text corpora (left‑to‑right generation). Second stage: supervised fine‑tuning on tasks such as natural language inference, QA, semantic similarity and classification. From GPT‑3 onward, Reinforcement Learning from Human Feedback (RLHF) is added to align the model with human preferences.
LLaMA
Unsupervised pre‑training followed by supervised fine‑tuning, reward‑model training and RLHF for alignment.
ChatGLM
Unlabeled pre‑training plus supervised fine‑tuning, feedback learning and RLHF.
Large Language Model Summary
The training approach for large language models is fundamentally massive unsupervised pre‑training followed by downstream fine‑tuning; from GPT‑3 onward, RLHF is incorporated to better align model outputs with human preferences.
Practical Deployment of ChatGLM2‑6B on UCloud UK8S
1. Obtain the project code and model files from the GitHub repository https://github.com/THUDM/ChatGLM2-6B/tree/main .
2. Create a UFS file system, upload the model files, and add a mount point.
3. In the UCloud console, create a UK8S cluster and configure the desired node specifications.
4. Install Docker, NVIDIA GPU drivers, and the NVIDIA Container Toolkit on each node.
5. Follow the UFS documentation to create a PersistentVolume (PV) and PersistentVolumeClaim (PVC).
6. Apply the following Kubernetes deployment manifest (saved as ufspod.yml) to mount the model files and expose a service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myfrontend
spec:
selector:
matchLabels:
app: myfrontend
replicas: 1
template:
metadata:
labels:
app: myfrontend
spec:
containers:
- name: myfrontend
image: uhub.service.ucloud.cn/yaoxl/chatglm2-6b:y1
volumeMounts:
- mountPath: "/app/models"
name: mypd
ports:
- containerPort: 7861
volumes:
- name: mypd
persistentVolumeClaim:
claimName: ufsclaim
---
apiVersion: v1
kind: Service
metadata:
name: myufsservice
spec:
selector:
app: myfrontend
type: NodePort
ports:
- name: http
protocol: TCP
port: 7861
targetPort: 7861
nodePort: 306197. Deploy the manifest: kubectl apply -f ufspod.yml 8. Retrieve the pod name: kubectl get po 9. Open a bash shell inside the pod: kubectl exec -it <pod_name> -- /bin/bash 10. Launch the web demo inside the container: python3 web_demo.py The demo will be accessible via the NodePort defined in the service (e.g., http://<node_ip>:30619).
Future Work
UCloud will continue to track developments in large language models and will publish further articles on LLaMA2 practice, LangChain‑based cloud inference environments, and more.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
