How to Pre‑train a 20M‑Parameter LLaMA‑3 Mini Model with Hugging Face Trainer

This step‑by‑step guide shows how to use Hugging Face's Trainer API to pre‑train an ultra‑small LLaMA‑3 model (under 20 M parameters) on the TinyStories dataset, covering model configuration, tokenizer setup, data preprocessing, collators, training arguments, and inference results.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How to Pre‑train a 20M‑Parameter LLaMA‑3 Mini Model with Hugging Face Trainer

1. Preparation

We aim to pre‑train a tiny LLaMA‑3 model (≈20 M parameters) using Hugging Face's Trainer. The goal is to reproduce the TinyStories experiment with a minimal model and dataset for learning purposes.

Required libraries (install the latest versions):

transformers
accelerate
datasets

Typical versions used:

torch==2.2.1
transformers==4.40.0
accelerate==0.29.3
datasets==2.18.0

2. Original Work Overview

The TinyStories paper investigates how small language models perform on short‑story generation. It created an English story dataset using GPT‑3.5/GPT‑4 and trained GPT‑Neo‑style models of various sizes, evaluating creativity, grammar, consistency, and instruction following.

3. Model Initialization

3.1 Model Configuration

We use the LLaMA‑3 architecture already integrated in transformers. The chosen hyper‑parameters (based on the original study) are hidden_size=256, num_hidden_layers=4, intermediate_size=768 (8/3 × hidden), num_attention_heads=16, and num_key_value_heads=8 (GQA).

Configuration code:

# Model configuration
from transformers import AutoConfig

hidden_size = 256
intermediate_size = (int(hidden_size * 8/3 / 128) + 1) * 128
config = AutoConfig.for_model(
    model_type='llama',
    hidden_size=hidden_size,
    intermediate_size=intermediate_size,
    num_attention_heads=16,
    num_hidden_layers=4,
    num_key_value_heads=8
)

3.2 Tokenizer

We adopt the LLaMA‑2 tokenizer (32 k vocab) to keep the vocabulary small for a tiny model.

# Tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('NousResearch/Llama-2-7b-hf')
# Ensure left‑padding for decoder‑only generation
tokenizer.padding_side = 'left'

3.3 Model Instantiation

Instantiate the model from the config (no pretrained weights) and move it to the appropriate device.

# Model
import torch
from transformers import AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float32).to(device)

Parameter count is ~19.5 M, with embeddings dominating the size.

4. Dataset Handling

4.1 Loading the TinyStoriesV2 Dataset

We load the dataset from Hugging Face. For quick experiments we use only 10 % of the training split.

from datasets import load_dataset

dataset_name = 'noanabeshima/TinyStoriesV2'

ds_train = load_dataset(dataset_name, split='train[:10%]')
ds_val   = load_dataset(dataset_name, split='validation')

4.2 Pre‑processing

Each example is tokenized without special tokens, truncated to a maximum of 2048 tokens, and an eos_token_id is appended.

def process_func(examples):
    max_token = 2048
    encoded = tokenizer(examples['text'], add_special_tokens=False)
    input_ids = encoded['input_ids']
    new_input_ids, new_attn_mask = [], []
    for ids in input_ids:
        temp = ids[-max_token+1:] + [tokenizer.eos_token_id]
        new_input_ids.append(temp)
        new_attn_mask.append([1] * len(temp))
    return {'input_ids': new_input_ids, 'attention_mask': new_attn_mask}

num_proc = 8
ds_train = ds_train.shuffle().map(process_func, batched=True, num_proc=num_proc, remove_columns=ds_train.column_names, desc='Tokenizing train')
ds_val   = ds_val.map(process_func, batched=True, num_proc=num_proc, remove_columns=ds_val.column_names, desc='Tokenizing val')

4.3 DataCollator for Language Modeling

We use DataCollatorForLanguageModeling with mlm=False so that labels are a shifted copy of input_ids. Padding is handled automatically (left‑padding as set in the tokenizer).

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

5. Training Setup

5.1 Training Arguments

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='saves',
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    eval_steps=1000,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    learning_rate=1e-4,
    lr_scheduler_type='cosine',
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    logging_steps=50,
    report_to=None,
    num_train_epochs=2,
    save_steps=1000,
    save_total_limit=2,
    seed=3407
)

5.2 Trainer Instantiation

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)

5.3 Running Training and Saving

Start training (≈1.5 h for 2 epochs on a single GPU) and monitor loss (final ~1.6). After training, the model can be used directly for inference or saved.

trainer.train()

# Save locally
model.save_pretrained('tiny_llama3')

tokenizer.save_pretrained('tiny_llama3')

Optionally push to Hugging Face Hub:

from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub('TinyStories-LLaMA3-20M')
tokenizer.push_to_hub('TinyStories-LLaMA3-20M')

6. Inference Example

def inference(model, tokenizer, input_text='Once upon a time, ', max_new_tokens=256):
    inputs = tokenizer(input_text, return_tensors='pt').to(device)
    outputs = model.generate(
        **inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=40,
        top_p=0.95,
        temperature=0.8
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

inference(model, tokenizer, 'Once upon a time, in a beautiful garden, there lived a little rabbit named Peter Rabbit.', 256)

The 20 M model generates fluent, grammatically correct short stories, though consistency of characters and plot can still be improved.

7. Conclusions

We successfully reproduced a tiny LLaMA‑3 model capable of story continuation. The experiment demonstrates that even sub‑20 M parameter models can learn meaningful language patterns when trained on a focused dataset. Future work could involve instruction‑fine‑tuning (SFT) or expanding the dataset to improve consistency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLaMAPretrainingLanguage ModelHugging FaceTrainerTinyStories
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.