How to Pre‑train a 20M‑Parameter LLaMA‑3 Mini Model with Hugging Face Trainer
This step‑by‑step guide shows how to use Hugging Face's Trainer API to pre‑train an ultra‑small LLaMA‑3 model (under 20 M parameters) on the TinyStories dataset, covering model configuration, tokenizer setup, data preprocessing, collators, training arguments, and inference results.
1. Preparation
We aim to pre‑train a tiny LLaMA‑3 model (≈20 M parameters) using Hugging Face's Trainer. The goal is to reproduce the TinyStories experiment with a minimal model and dataset for learning purposes.
Required libraries (install the latest versions):
transformers
accelerate
datasetsTypical versions used:
torch==2.2.1
transformers==4.40.0
accelerate==0.29.3
datasets==2.18.02. Original Work Overview
The TinyStories paper investigates how small language models perform on short‑story generation. It created an English story dataset using GPT‑3.5/GPT‑4 and trained GPT‑Neo‑style models of various sizes, evaluating creativity, grammar, consistency, and instruction following.
3. Model Initialization
3.1 Model Configuration
We use the LLaMA‑3 architecture already integrated in transformers. The chosen hyper‑parameters (based on the original study) are hidden_size=256, num_hidden_layers=4, intermediate_size=768 (8/3 × hidden), num_attention_heads=16, and num_key_value_heads=8 (GQA).
Configuration code:
# Model configuration
from transformers import AutoConfig
hidden_size = 256
intermediate_size = (int(hidden_size * 8/3 / 128) + 1) * 128
config = AutoConfig.for_model(
model_type='llama',
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=16,
num_hidden_layers=4,
num_key_value_heads=8
)3.2 Tokenizer
We adopt the LLaMA‑2 tokenizer (32 k vocab) to keep the vocabulary small for a tiny model.
# Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('NousResearch/Llama-2-7b-hf')
# Ensure left‑padding for decoder‑only generation
tokenizer.padding_side = 'left'3.3 Model Instantiation
Instantiate the model from the config (no pretrained weights) and move it to the appropriate device.
# Model
import torch
from transformers import AutoModelForCausalLM
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float32).to(device)Parameter count is ~19.5 M, with embeddings dominating the size.
4. Dataset Handling
4.1 Loading the TinyStoriesV2 Dataset
We load the dataset from Hugging Face. For quick experiments we use only 10 % of the training split.
from datasets import load_dataset
dataset_name = 'noanabeshima/TinyStoriesV2'
ds_train = load_dataset(dataset_name, split='train[:10%]')
ds_val = load_dataset(dataset_name, split='validation')4.2 Pre‑processing
Each example is tokenized without special tokens, truncated to a maximum of 2048 tokens, and an eos_token_id is appended.
def process_func(examples):
max_token = 2048
encoded = tokenizer(examples['text'], add_special_tokens=False)
input_ids = encoded['input_ids']
new_input_ids, new_attn_mask = [], []
for ids in input_ids:
temp = ids[-max_token+1:] + [tokenizer.eos_token_id]
new_input_ids.append(temp)
new_attn_mask.append([1] * len(temp))
return {'input_ids': new_input_ids, 'attention_mask': new_attn_mask}
num_proc = 8
ds_train = ds_train.shuffle().map(process_func, batched=True, num_proc=num_proc, remove_columns=ds_train.column_names, desc='Tokenizing train')
ds_val = ds_val.map(process_func, batched=True, num_proc=num_proc, remove_columns=ds_val.column_names, desc='Tokenizing val')4.3 DataCollator for Language Modeling
We use DataCollatorForLanguageModeling with mlm=False so that labels are a shifted copy of input_ids. Padding is handled automatically (left‑padding as set in the tokenizer).
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)5. Training Setup
5.1 Training Arguments
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir='saves',
overwrite_output_dir=True,
do_train=True,
do_eval=True,
eval_steps=1000,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
learning_rate=1e-4,
lr_scheduler_type='cosine',
bf16=torch.cuda.is_bf16_supported(),
fp16=not torch.cuda.is_bf16_supported(),
logging_steps=50,
report_to=None,
num_train_epochs=2,
save_steps=1000,
save_total_limit=2,
seed=3407
)5.2 Trainer Instantiation
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_val,
tokenizer=tokenizer,
data_collator=data_collator
)5.3 Running Training and Saving
Start training (≈1.5 h for 2 epochs on a single GPU) and monitor loss (final ~1.6). After training, the model can be used directly for inference or saved.
trainer.train()
# Save locally
model.save_pretrained('tiny_llama3')
tokenizer.save_pretrained('tiny_llama3')Optionally push to Hugging Face Hub:
from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub('TinyStories-LLaMA3-20M')
tokenizer.push_to_hub('TinyStories-LLaMA3-20M')6. Inference Example
def inference(model, tokenizer, input_text='Once upon a time, ', max_new_tokens=256):
inputs = tokenizer(input_text, return_tensors='pt').to(device)
outputs = model.generate(
**inputs,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=max_new_tokens,
do_sample=True,
top_k=40,
top_p=0.95,
temperature=0.8
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
inference(model, tokenizer, 'Once upon a time, in a beautiful garden, there lived a little rabbit named Peter Rabbit.', 256)The 20 M model generates fluent, grammatically correct short stories, though consistency of characters and plot can still be improved.
7. Conclusions
We successfully reproduced a tiny LLaMA‑3 model capable of story continuation. The experiment demonstrates that even sub‑20 M parameter models can learn meaningful language patterns when trained on a focused dataset. Future work could involve instruction‑fine‑tuning (SFT) or expanding the dataset to improve consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
