Artificial Intelligence 14 min read

Deploy ChatGLM2‑6B on UCloud K8S: Complete Guide to Large Language Model Inference

This article reviews the architectures, training methods, and key characteristics of major open‑source large language models such as BERT, GPT, T5, LLaMA and ChatGLM, and then provides a step‑by‑step tutorial for deploying ChatGLM2‑6B on UCloud's UK8S platform using UFS storage, Kubernetes manifests, and command‑line tools.

UCloud Tech

Aug 14, 2023

Deploy ChatGLM2‑6B on UCloud K8S: Complete Guide to Large Language Model Inference

To meet customer demand for large‑model usage, UCloud’s image market now offers deployment and compute scheduling for open‑source models including Alpaca‑LoRA, ChatGLM, T5, MiniGPT‑4, Stable Diffusion, LLaMA2 and Milvus vector database, enabling rapid setup of fine‑tuning or inference environments.

Model Architectures and Characteristics

Since Google introduced the Transformer in 2017, mainstream models such as GPT, BERT, T5, ChatGLM and the LLaMA series have been built on this core architecture.

BERT

Uses the Transformer encoder. Characteristics: bidirectional attention that captures full context, excels at text understanding, not suited for generation tasks.

GPT

Uses the Transformer decoder. Characteristics: unidirectional (left‑to‑right) attention, ideal for text generation.

T5

Adopts an encoder‑decoder structure. Modifications: removed bias from layer‑norm, moved layer‑norm outside the residual path, introduced relative position encoding after the first self‑attention query‑key multiplication. Characteristics: encoder attention is bidirectional, decoder attention is unidirectional, handles both understanding and generation, and has a large parameter count.

LLaMA

Uses the Transformer decoder. Modifications: pre‑norm (normalizing inputs of each sub‑layer), SwiGLU activation replacing ReLU, rotary embeddings replacing absolute position embeddings. Characteristics: LLaMA‑13B is ten times smaller than GPT‑3 (175B) yet outperforms it on most benchmarks; Chinese performance is weak because Chinese corpora were not included in pre‑training.

ChatGLM

Based on GLM‑130B, combines advantages of BERT, GPT and T5 via a custom mask matrix. Modifications: custom mask matrix, reordered layer‑norm and residual connections, added a separate linear layer for output prediction, replaced ReLU with GeLU, introduced 2‑D position encoding. Characteristics: integrates the strengths of the three models.

Training Methods and Objectives

All large language models share a two‑stage paradigm: massive unsupervised pre‑training on unlabeled text, followed by supervised fine‑tuning on downstream tasks.

BERT

Pre‑training objectives include Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). After pre‑training, the model is fine‑tuned on task‑specific labeled data.

T5

Training consists of Encoder‑Decoder Pretraining (masking tokens and predicting them) and Denoising Auto‑Encoder Pretraining (adding noise or random permutations and reconstructing the original sequence). The pre‑training objective is similar to MLM but can mask contiguous spans.

GPT

First stage: unsupervised pre‑training on large text corpora (left‑to‑right generation). Second stage: supervised fine‑tuning on tasks such as natural language inference, QA, semantic similarity and classification. From GPT‑3 onward, Reinforcement Learning from Human Feedback (RLHF) is added to align the model with human preferences.

LLaMA

Unsupervised pre‑training followed by supervised fine‑tuning, reward‑model training and RLHF for alignment.

ChatGLM

Unlabeled pre‑training plus supervised fine‑tuning, feedback learning and RLHF.

Large Language Model Summary

The training approach for large language models is fundamentally massive unsupervised pre‑training followed by downstream fine‑tuning; from GPT‑3 onward, RLHF is incorporated to better align model outputs with human preferences.

Practical Deployment of ChatGLM2‑6B on UCloud UK8S

1. Obtain the project code and model files from the GitHub repository https://github.com/THUDM/ChatGLM2-6B/tree/main .

2. Create a UFS file system, upload the model files, and add a mount point.

3. In the UCloud console, create a UK8S cluster and configure the desired node specifications.

4. Install Docker, NVIDIA GPU drivers, and the NVIDIA Container Toolkit on each node.

5. Follow the UFS documentation to create a PersistentVolume (PV) and PersistentVolumeClaim (PVC).

6. Apply the following Kubernetes deployment manifest (saved as ufspod.yml) to mount the model files and expose a service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myfrontend
spec:
  selector:
    matchLabels:
      app: myfrontend
  replicas: 1
  template:
    metadata:
      labels:
        app: myfrontend
    spec:
      containers:
      - name: myfrontend
        image: uhub.service.ucloud.cn/yaoxl/chatglm2-6b:y1
        volumeMounts:
        - mountPath: "/app/models"
          name: mypd
        ports:
        - containerPort: 7861
      volumes:
      - name: mypd
        persistentVolumeClaim:
          claimName: ufsclaim
---
apiVersion: v1
kind: Service
metadata:
  name: myufsservice
spec:
  selector:
    app: myfrontend
  type: NodePort
  ports:
  - name: http
    protocol: TCP
    port: 7861
    targetPort: 7861
    nodePort: 30619

7. Deploy the manifest: kubectl apply -f ufspod.yml 8. Retrieve the pod name: kubectl get po 9. Open a bash shell inside the pod: kubectl exec -it <pod_name> -- /bin/bash 10. Launch the web demo inside the container: python3 web_demo.py The demo will be accessible via the NodePort defined in the service (e.g., http://<node_ip>:30619).

Future Work

UCloud will continue to track developments in large language models and will publish further articles on LLaMA2 practice, LangChain‑based cloud inference environments, and more.

Kubernetes ChatGLM UCloud

Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.