From Transformers to LLaMA 4: A Journey Through the Biggest LLMs
This article surveys the most influential large language models released since 2017, detailing the core innovations of Transformer, BERT, GPT series, T5, Retrieval‑Augmented Generation, and the latest LLaMA and Meta models, while highlighting their architectures, training paradigms, and impact on NLP research.
Attention Is All You Need (2017)
https://arxiv.org/abs/1706.03762
The paper introduces the Transformer architecture, which replaces recurrent and convolutional layers with a pure attention mechanism. Multi‑head self‑attention allows each token to attend to all others in parallel, enabling efficient training on long sequences and improving representation capacity.
BERT: Bidirectional Encoder Representations from Transformers (2018)
https://arxiv.org/abs/1810.04805
BERT uses the encoder stack of the Transformer and is pretrained with two objectives:
Masked Language Modeling (MLM) : 15 % of tokens are masked and the model predicts them using both left and right context.
Next Sentence Prediction (NSP) : the model learns to predict whether two sentences follow each other, capturing inter‑sentence relationships.
After pretraining, a simple classification head is added and the whole model is fine‑tuned on downstream tasks (e.g., QA, NER, text classification), establishing the pre‑training → fine‑tuning paradigm.
T5: Text‑to‑Text Transfer Transformer (2019)
https://arxiv.org/abs/1910.10683
T5 reformulates every NLP task as a text‑to‑text problem: given an input string, generate an output string. Examples:
Translation: translate English to German: That is good. → Das ist gut. Text classification: cola sentence: The course is jumping well. → not acceptable Summarization: summarize: [original text] → [summary] The model is pretrained on a massive unsupervised corpus and can be fine‑tuned for any downstream task without architectural changes.
Retrieval‑Augmented Generation (RAG) (2020)
https://arxiv.org/abs/2005.11401
RAG combines a parametric generator with a non‑parametric external knowledge base:
Retriever : a dual‑encoder (based on BERT) encodes the query and documents, then retrieves the top‑K most similar passages from a large index (e.g., Wikipedia).
Generator : a seq2seq model (e.g., BART) conditions on the retrieved passages. Two variants exist:
RAG‑Sequence : the same set of retrieved documents is used for the entire generated output.
RAG‑Token : at each generation step the model can attend to different retrieved documents, allowing finer‑grained information stitching.
This architecture mitigates knowledge staleness, reduces hallucination, and provides source attribution.
GPT‑1: Generative Pre‑Training (2018)
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
GPT‑1 demonstrates a semi‑supervised “pre‑training → fine‑tuning” pipeline:
Unsupervised pre‑training : a Transformer decoder is trained to predict the next token on a large unlabeled corpus.
Supervised fine‑tuning : the pretrained weights initialize a task‑specific model that is trained on a small labeled dataset.
This approach shows that a single language model can be adapted to many tasks with minimal task‑specific data.
GPT‑2: Scaling Up (2019)
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT‑2 scales the decoder‑only Transformer to 1.5 B parameters and trains on a diverse web‑scale corpus. The authors argue that a sufficiently large language model becomes an unsupervised multitask learner, implicitly acquiring abilities such as translation, summarization, and question answering without explicit supervision.
GPT‑3: Few‑Shot Learning (2020)
https://arxiv.org/abs/2005.14165
GPT‑3 (175 B parameters) demonstrates “in‑context learning”: the model can perform a new task by conditioning on a prompt that includes a natural‑language instruction and a few examples, without any gradient updates. Three prompting regimes are defined:
Zero‑shot : only an instruction (e.g., “Translate English to French: …”).
One‑shot : one input‑output example followed by a new query.
Few‑shot : several examples are provided.
Performance often matches or exceeds fine‑tuned state‑of‑the‑art models.
ChatGPT: Conversational Interface (2022)
https://openai.com/blog/chatgpt
ChatGPT is a GPT‑based model fine‑tuned for dialogue using Reinforcement Learning from Human Feedback (RLHF):
Supervised fine‑tuning : human AI trainers generate high‑quality user‑assistant conversations.
Reward model training : trainers rank multiple model responses to the same prompt; the rankings train a reward model.
Proximal Policy Optimization (PPO) : the base model is optimized against the reward model to produce more helpful, truthful, and safe replies.
The resulting system can answer follow‑up questions, admit mistakes, challenge false premises, and refuse harmful requests.
GPT‑4: Multimodal Capabilities (2023)
https://arxiv.org/abs/2303.08774
GPT‑4 is a large multimodal model that accepts image and text inputs and generates text outputs. It achieves near‑human performance on many professional and academic benchmarks while still lagging behind humans on complex real‑world reasoning.
OpenAI Sora: World Simulation (2024)
https://openai.com/sora
Sora is a diffusion‑based video generation system that creates physically plausible world simulations from textual descriptions. It ensures temporal consistency across long sequences and supports camera‑movement simulation.
Google PaLM (2022)
https://arxiv.org/abs/2204.02311
PaLM (Pathways Language Model) contains 540 B parameters and is trained with the Pathways system, which efficiently scales training across thousands of accelerators. Using chain‑of‑thought prompting, PaLM outperforms many fine‑tuned SOTA models on multi‑step reasoning tasks.
Switch Transformer (Mixture‑of‑Experts) (2021)
https://arxiv.org/abs/2101.03961
The Switch Transformer applies a Mixture‑of‑Experts (MoE) architecture: for each input token, a routing network activates only one expert (a sub‑network of parameters). This yields:
Parameter counts up to trillions while keeping inference compute comparable to a much smaller dense model.
Training speedups of up to 7× relative to dense models of similar compute budget.
Meta OPT (2022)
https://arxiv.org/abs/2205.01068
OPT is a suite of decoder‑only Transformers ranging from 125 M to 175 B parameters. Trained on NVIDIA A100 GPUs with efficient strategies, OPT‑175B’s carbon footprint is roughly one‑seventh that of GPT‑3. The models are released openly to facilitate research on robustness, bias, and toxicity.
LLaMA 1 (2023)
https://arxiv.org/abs/2302.13971
LLaMA (Large Language Model Meta AI) provides models from 7 B to 65 B parameters trained exclusively on publicly available data. The authors demonstrate that smaller models trained on more data can match or exceed the performance of larger proprietary models.
Stanford Alpaca (2023)
https://crfm.stanford.edu/2023/03/13/alpaca.html
Alpaca‑7B fine‑tunes LLaMA‑7B on 52 k instruction‑following examples generated via the OpenAI API. Despite its modest size, Alpaca achieves instruction‑following ability comparable to OpenAI’s text-davinci-003 at a total cost under $600.
LLaMA 2 (2023)
https://arxiv.org/abs/2307.09288
LLaMA 2 releases both pretrained base models (70 M–70 B parameters) and chat‑tuned variants (LLaMA‑2‑Chat). The paper details the supervised fine‑tuning (SFT) pipeline and RLHF alignment process, enabling commercial use while preserving open‑source accessibility.
LLaMA 3 (2024)
https://ai.meta.com/blog/meta-llama-3/
LLaMA 3 adds 8 B and 70 B parameter models, claiming to be the strongest open‑source models and competitive with closed‑source systems such as Claude Sonnet and GPT‑3.5. A 400 B model is in development, with future multimodal and multilingual extensions planned.
LLaMA 4 (2025)
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
LLaMA 4 adopts a Mixture‑of‑Experts architecture with native multimodal support. Two variants are released:
Llama 4 Scout : 17 B active parameters, 16 experts, fits on a single NVIDIA H100 GPU, and offers a 10 million‑token context window for long‑document analysis and code‑base reasoning.
Llama 4 Maverick : 17 B active parameters, 128 experts (total 400 B parameters), delivering a strong performance‑to‑cost ratio for image‑text understanding and general‑assistant chat scenarios.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
