Artificial Intelligence 18 min read

How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned

This article details the end‑to‑end process of pre‑training, fine‑tuning, and evaluating a 1‑billion‑parameter Chinese LLM named Steel‑LLM on limited hardware, covering data collection, pipeline design, training framework choices, architectural tweaks, performance results, and practical lessons for resource‑constrained developers.

Baobao Algorithm Notes

Nov 14, 2024

How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned

Background and Goals

The Steel‑LLM project started in March 2024 when the author obtained a single A100 GPU and decided to train a small, Chinese‑focused LLM from scratch. The objectives were to keep the model under 1 B parameters and to use a terabyte‑scale dataset while documenting every engineering detail for the community.

Data Collection & Processing

All pre‑training data are open‑source, primarily the Skywork/Skypile‑150B (600 GB), wanjuan1.0 (1 TB), and StarCoder code slices (200 GB). English data are limited to ~400 GB. Additional dialogue data (Baidu Baike QA, BELLE, Moss) are mixed in using a custom prompt format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
{answer}<|im_end|>

The processing pipeline, illustrated below, normalises inputs and converts them to token IDs. Small datasets are filtered with Alibaba’s data‑juicer tool, which creates a YAML‑driven operator chain. The pipeline requires 3–4 TB of disk space and multi‑process execution to finish within hours.

Training Framework

The codebase is forked from TinyLlama and adapted for HuggingFace Transformers compatibility. Key improvements include:

Support for HuggingFace model definitions, enabling easy swapping of architectures.

Precise data‑progress checkpointing that records file names and per‑sample indices, allowing exact resumption after interruptions.

Dynamic data‑addition handling that reshuffles new and old samples to avoid distribution drift.

MD5‑based duplicate detection that hashes only the head and tail of each file for efficiency.

Training runs on native PyTorch FSDP (Fully‑Sharded Data‑Parallel) rather than Megatron, as the 1 B model fits comfortably on a single‑node 8‑GPU setup.

Model Architecture Tweaks

Beyond the standard transformer stack (self‑attention, RMSNorm, RoPE), the author experimented with two FFN modifications:

Soft MoE : A soft mixture‑of‑experts replaces hard top‑k routing, reducing memory pressure while still leveraging expert specialization. Three versions were tried; the third achieved acceptable convergence and efficiency.

SENet‑style FFN : Inspired by Squeeze‑and‑Excitation networks, the second FFN layer was re‑implemented with gate_proj and up_proj followed by a down_proj, mirroring Qwen‑2’s design.

Relevant code snippet:

class Qwen2MoeMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]
    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

Training Process

Pre‑training used a maximum sequence length of 2048, two epochs over ~1.1 T tokens (~1 M steps). Batch size 8 with gradient accumulation 8, AdamW optimizer, peak LR 3e‑4, cosine‑annealing warm‑up schedule. On an 8‑A100 80 GB cluster the run would take ~60 days; on an 8‑H800 it drops to ~30 days. Training was monitored via Weights & Biases ( https://api.wandb.ai/links/steel-llm-lab/vqf297nr).

Mid‑project hardware loss forced a pause, but a top‑3 university provided an H800 for the final phase.

Fine‑tuning and Evaluation

Fine‑tuning employed LLaMA‑Factory on four datasets: BAAI/Infinity‑Instruct (filtered to 70 k Chinese examples), wanjuan Chinese multiple‑choice, ruozhiba (forum‑derived Q&A), and a self‑knowledge set to teach the model its own name. The model was evaluated on Chinese benchmarks CEval and CMMlu.

Results:

CEval accuracy: 38 % (after curated Chinese subset).

CMMlu accuracy: 33 % (baseline), improved to 36 % when the benchmark data were added to SFT.

Attempts to use chain‑of‑thought prompting did not yield further gains.

Comparison chart (image) shows Steel‑LLM versus MiniCPM, MAP‑Neo, and other open‑source models.

Conclusions and Future Work

Even with limited resources, a small Chinese‑centric LLM can achieve competitive scores on standard benchmarks. Limitations include a weak tokenizer (reused from Qwen), modest English capability, and incomplete data‑cleaning pipelines. Future plans involve more extensive SFT, reinforcement learning, and multimodal extensions.

All code and model checkpoints are publicly available:

GitHub repository: https://github.com/zhanshijinwat/Steel-LLM Model mirrors: https://hf-mirror.com/gqszhanshijin/Steel-LLM and

https://modelscope.cn/models/zhanshijin/Steel-LLM

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline LLM pretraining open-source Training Optimization model architecture

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.