From Transformers to DeepSeek‑R1: Tracing the Evolution of Large Language Models (2017‑2025)

This article chronicles the rapid development of large language models from the 2017 Transformer breakthrough through successive milestones such as BERT, GPT‑3, ChatGPT, multimodal GPT‑4 variants, open‑weight releases, and the cost‑efficient DeepSeek‑R1, highlighting key architectural innovations, training paradigms, alignment techniques, and industry impact.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
From Transformers to DeepSeek‑R1: Tracing the Evolution of Large Language Models (2017‑2025)

1. Language Models and Autoregressive LLMs

Language models (LMs) learn statistical patterns from large text corpora to predict the next token. Large language models (LLMs) are a subset of LMs with billions of parameters (e.g., GPT‑3’s 175 B). Most LLMs operate autoregressively: given a sequence of tokens they output a probability distribution for the next token, enabling text generation. Decoding can be greedy (choose the highest‑probability token) or stochastic (sample from the distribution), which yields diverse outputs.

2. Transformer Architecture (2017)

The Transformer replaces recurrent networks with self‑attention, allowing each token to attend to all others in parallel. Core components include:

Self‑attention: Attention(Q,K,V)=softmax(QK^T/√d_k)V, where Q, K, V are query, key, and value matrices.

Multi‑head attention: multiple attention heads run in parallel and their outputs are concatenated, capturing different relational aspects.

Position‑wise feed‑forward networks (FFN) applied to each token independently.

Layer normalization and residual connections for stable deep training.

Sinusoidal positional encodings to inject token order information.

These innovations enable full parallelism, scalability to very large models, and superior contextual understanding.

Transformer architecture diagram
Transformer architecture diagram

3. Pre‑training Era (2018‑2020)

3.1 BERT: Bidirectional Context

BERT (2018) uses a Transformer encoder trained with two objectives:

Masked Language Modeling (MLM): random tokens are replaced with [MASK] and the model predicts them, forcing bidirectional context usage.

Next Sentence Prediction (NSP): the model learns to predict whether two sentences appear consecutively.

These tasks yielded state‑of‑the‑art results on GLUE, SQuAD and many downstream benchmarks.

BERT architecture
BERT architecture

3.2 GPT Series: Generative Pre‑training

GPT (2018) introduced causal (autoregressive) pre‑training using a Transformer decoder. Subsequent models scaled dramatically:

GPT‑2 (2019): 1.5 B parameters, demonstrated strong zero‑shot capabilities across text generation, summarisation and translation.

GPT‑3 (2020): 175 B parameters, achieved few‑shot and zero‑shot performance on a wide range of tasks, establishing the scaling law that larger models trained on more data improve capability.

GPT‑3 scale and capabilities
GPT‑3 scale and capabilities

4. Post‑Training Alignment (2021‑2022)

4.1 Supervised Fine‑Tuning (SFT)

SFT (instruction tuning) fine‑tunes a pre‑trained LLM on high‑quality input‑output pairs or demonstrations, teaching the model to follow explicit user instructions.

4.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF adds a second stage to improve alignment:

Train a reward model on human‑ranked outputs (preference data).

Use Proximal Policy Optimization (PPO) to fine‑tune the LLM against the reward model, encouraging helpful, honest and safe responses.

This two‑stage pipeline reduces hallucinations and improves conformity to human values.

RLHF pipeline
RLHF pipeline

5. Multimodal Models (2023‑2024)

5.1 GPT‑4V: Vision‑Language

GPT‑4V integrates a vision encoder with the GPT‑4 language core via cross‑modal attention. It can caption images, answer visual questions and perform medical‑image analysis, demonstrating seamless text‑image interaction.

5.2 GPT‑4o: Full‑Modality

GPT‑4o adds audio and video streams, enabling speech‑to‑text transcription, video description and text‑to‑speech synthesis within a single model.

Multimodal GPT‑4o example
Multimodal GPT‑4o example

6. Open‑Source and Open‑Weight Models (2023‑2024)

Open‑weight LLMs (e.g., Meta’s LLaMA, Mistral 7B) release model weights for fine‑tuning while keeping the training code closed. Fully open‑source models (e.g., OPT, BERT) publish both weights and code, enabling community‑driven research, LoRA/PEFT fine‑tuning and domain‑specific adaptations.

Open‑source model ecosystem
Open‑source model ecosystem

7. Reasoning Models (2024‑2025)

7.1 OpenAI‑o1

Released in September 2024, o1‑preview introduces “Long Chain‑of‑Thought” (Long CoT) internal reasoning. The model decomposes problems, self‑critiques solutions and explores alternatives before emitting a final answer. This architecture yields near‑human performance on math (AIME 2024), coding (Codeforces) and scientific reasoning benchmarks.

o1 reasoning workflow
o1 reasoning workflow

7.2 OpenAI‑o3

January 2025 saw the release of o3, which builds on o1 and achieves breakthrough scores on ARC‑AGI (87.5 % accuracy), SWE‑Bench Verified (71.7 %) and FrontierMath (25.2 %).

o3 benchmark performance
o3 benchmark performance

8. Cost‑Efficient Inference Models: DeepSeek‑R1 (2025)

8.1 DeepSeek‑V3 (2024‑12)

DeepSeek‑V3 contains roughly 671 B parameters (≈370 B active) and uses a Mixture‑of‑Experts (MoE) design to keep training cost low (≈$5.6 M). Key architectural innovations are:

Multi‑Head Latent Attention (MLA): compresses Q/K/V matrices to reduce memory while preserving attention quality, enhanced with RoPE positional embeddings.

DeepSeekMoE: combines shared and routed experts in the feed‑forward network for balanced utilization.

Multi‑Token Prediction (MTP): predicts several tokens simultaneously, improving generation speed for long sequences.

DeepSeek‑V3 architecture
DeepSeek‑V3 architecture

8.2 DeepSeek‑R1‑Zero and DeepSeek‑R1 (2025)

DeepSeek‑R1‑Zero removes the SFT stage and applies rule‑based reinforcement learning directly to the pre‑trained base model. The reward is computed by Group Relative Policy Optimization (GRPO), a simple rule‑based RL method that scales efficiently. DeepSeek‑R1 adds a small curated dataset and additional RL phases (including rejection sampling) to improve readability and alignment with human preferences.

DeepSeek‑R1‑Zero training pipeline
DeepSeek‑R1‑Zero training pipeline
DeepSeek‑R1 fine‑tuning stages
DeepSeek‑R1 fine‑tuning stages

8.3 Industry Impact

DeepSeek‑R1’s inference cost is roughly 1/30 of comparable OpenAI models, and its open‑weight release has led major cloud providers (AWS, Azure, Google Cloud) to add the model to their offerings, accelerating democratized access to advanced LLM capabilities.

Market impact of DeepSeek‑R1
Market impact of DeepSeek‑R1

Conclusion

The evolution from the 2017 Transformer breakthrough to the 2025 DeepSeek‑R1 highlights four pivotal milestones: (1) the Transformer foundation enabling parallel, scalable language modeling; (2) the scaling surge exemplified by GPT‑3, proving that larger models trained on more data improve performance; (3) the democratizing effect of ChatGPT and instruction‑tuned models that brought conversational AI to a broad audience; and (4) the cost‑efficient, open‑weight paradigm of DeepSeek‑R1, which makes state‑of‑the‑art LLM capabilities affordable and widely accessible. Together these advances chart a path toward increasingly capable, multimodal, and accessible AI systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceTransformeropen-source AIReasoning ModelsModel AlignmentCost‑Efficient Inference
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.