Artificial Intelligence 12 min read

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

ColossalChat, an open‑source project built on LLaMA, offers a full RLHF pipeline—including supervised fine‑tuning, reward‑model training, and reinforcement learning—enabling low‑cost, bilingual ChatGPT‑like models with 4‑bit quantized inference, detailed code, dataset, and performance optimizations.

21CTO

Mar 31, 2023

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

Why an Open‑Source ChatGPT Clone Matters

In recent months, AI applications such as ChatGPT and GPT‑4 have sparked a new industrial revolution, but OpenAI has not open‑sourced the underlying models. Colossal‑AI provides a fully open‑source solution that reproduces the complete RLHF workflow.

ColossalChat Overview

ColossalChat is built on the LLaMA foundation model and is currently the most practical open‑source project that mirrors ChatGPT’s original technical approach.

Open‑source repository: https://github.com/hpcaitech/ColossalAI

Demo: online model demo without registration or waiting list.

Training code: complete RLHF training code, supporting 7B and 13B model sizes.

Dataset: a bilingual (Chinese‑English) dataset with 104K examples.

Inference deployment: 4‑bit quantized 7B‑parameter model runs on a single 4 GB GPU.

Model weights: can be reproduced on a single server with modest compute.

Future: larger models, datasets, and optimizations will be added continuously.

Affordable Model, Strong Capability

With fewer than 10 B parameters and RLHF fine‑tuning, ColossalChat achieves bilingual performance comparable to ChatGPT and GPT‑3.5.

Example of a Chinese‑English QA interaction:

Generated email draft:

Algorithm sketch:

Full ChatGPT Clone Solution

While models like Meta’s LLaMA and Stanford’s Alpaca demonstrate strong performance, they lack instruction fine‑tuning and comprehensive RLHF alignment. ColossalChat implements the entire RLHF pipeline, making it the closest open‑source replica of ChatGPT’s original training strategy.

RLHF Algorithm Reproduction

Stage 1 – Supervised Fine‑Tuning (SFT) : fine‑tune the LLaMA model on the bilingual dataset.

Stage 2 – Reward Model Training : collect multiple responses per prompt, rank them, and train a reward model to predict human preferences.

Stage 3 – Reinforcement Learning (PPO) : generate experiences using SFT, Actor, Reward Model, and Critic; store them in a buffer; then update parameters using policy and value losses. PTX adds the pre‑training cross‑entropy loss to preserve the original language model knowledge.

Quick Start

# Training with a 4‑GPU server (SFT)
colossalai run --nproc_per_node=4 train_sft.py \
  --pretrain "/path/to/LLaMa-7B/" \
  --model 'llama' \
  --strategy colossalai_zero2 \
  --log_interval 10 \
  --save_path /path/to/Coati-7B \
  --dataset /path/to/data.json \
  --batch_size 4 \
  --accimulation_steps 8 \
  --lr 2e-5

# Training with a 4‑GPU server (Reward Model)
colossalai run --nproc_per_node=4 train_reward_model.py \
  --pretrain "/path/to/LLaMa-7B/" \
  --model 'llama' \
  --strategy colossalai_zero2 \
  --dataset /path/to/datasets

# Training with an 8‑GPU server (RL / PPO)
colossalai run --nproc_per_node=8 train_prompts.py prompts.csv \
  --strategy colossalai_zero2 \
  --pretrain "/path/to/Coati-7B" \
  --model 'llama' \
  --pretrain_dataset /path/to/dataset

After obtaining the final weights, quantize the model to 4‑bit and serve it with a single ~4 GB GPU:

python server.py /path/to/pretrained \
  --quant 4bit \
  --gptq_checkpoint /path/to/coati-7b-4bit-128g.pt \
  --gptq_group_size 128

if args.quant == '4bit':
    model = load_quant(args.pretrained, args.gptq_checkpoint, 4, args.gptq_group_size)

System Performance Optimizations

Colossal‑AI’s ZeRO optimizer and Gemini memory manager reduce memory redundancy, enabling larger models on the same hardware. Compared with Alpaca’s FSDP, training speed is more than twice as fast.

Low‑rank adaptation (LoRA) allows cheap fine‑tuning by updating only a small low‑rank matrix while keeping the base model frozen.

GPTQ 4‑bit quantization cuts GPU memory usage by ~75% versus FP16 with minimal impact on throughput and perplexity. A 7 B‑parameter model runs on a consumer‑grade GPU (e.g., RTX 3060) with a single line of code.

Open Collaboration

Contributions are welcomed via GitHub issues or pull requests, community Slack/WeChat groups, or formal partnership proposals sent to [email protected] .