From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

This article chronicles the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, multimodal models, and recent cost‑efficient innovations like DeepSeek‑R1, highlighting key architectures, training methods, alignment techniques, and their transformative impact on AI applications.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

In early 2025, China launched DeepSeek‑R1, a groundbreaking and cost‑effective large language model (LLM) that triggered a major shift in the AI field.

This article reviews the history of LLMs, beginning with the revolutionary Transformer architecture introduced in 2017.

Image
Image

1. What Is a Language Model?

A language model is an AI system designed to process, understand, and generate human‑like language. It learns patterns and structures from massive datasets, enabling applications such as translation, summarization, chatbots, and content generation.

Image
Image

1.1 Large Language Models (LLMs)

LLMs are a subset of language models that are orders of magnitude larger, often containing billions of parameters (e.g., GPT‑3 with 175 billion). Their scale gives them superior performance across a wide range of tasks.

The term “LLM” gained attention between 2018‑2019 with the rise of Transformer‑based models such as BERT and GPT‑1, and became widespread after GPT‑3’s release in 2020.

1.2 Autoregressive Language Models

Most LLMs operate autoregressively: they predict the next token based on preceding text, learning complex language patterns and excelling at text generation.

Mathematically, an LLM defines a probability distribution over the next token \(x_n\) given the previous tokens \(x_{1:n-1}\): p(x_n \mid x_{1:n-1}) During generation, decoding strategies such as greedy search or sampling from the probability distribution are used, producing diverse outputs similar to human language variability.

Image
Image

1.3 Generation Capability

LLMs generate text token‑by‑token from a prompt, iteratively adding predicted tokens until a stop condition is met, much like a “word‑chain” game.

Image
Image

This capability fuels applications such as creative writing, conversational AI, and automated customer support.

2. The Transformer Revolution (2017)

Vaswani et al. introduced the Transformer in the paper “Attention Is All You Need,” solving key limitations of earlier RNN and LSTM models, which struggled with long‑range dependencies and suffered from gradient‑vanishing problems.

Image
Image

2.1 Key Innovations of the Transformer

Self‑Attention : Allows each token to weigh its relevance to every other token, enabling parallel computation and global context understanding.

Image
Image

Q, K, V matrices represent queries, keys, and values; the dot‑product of Q and K determines attention weights.

Image
Image

Multi‑Head Attention : Multiple attention heads operate in parallel, each focusing on different aspects of the input.

Image
Image

Additional components include Feed‑Forward Networks, Layer Normalization, residual connections, and positional encodings that inject order information.

Image
Image

Impact on Language Modeling

Scalability: Full parallelism enables training on massive datasets.

Contextual Understanding: Self‑attention captures both local and global dependencies.

3. The Pre‑Training Era (2018‑2020)

The Transformer paved the way for pre‑trained models such as BERT and GPT, which demonstrated the power of large‑scale pre‑training followed by task‑specific fine‑tuning.

3.1 BERT – Bidirectional Context (2018)

Google’s BERT uses a bidirectional encoder to capture context from both directions, achieving state‑of‑the‑art results on tasks like classification, NER, and sentiment analysis.

Image
Image

Key innovations:

Masked Language Modeling (MLM): Randomly mask tokens and predict them using full sentence context.

Next Sentence Prediction (NSP): Predict whether two sentences appear consecutively, improving coherence understanding.

3.2 GPT – Autoregressive Generation (2018‑2020)

OpenAI’s GPT series focuses on autoregressive pre‑training using the Transformer decoder, excelling at text generation.

Image
Image

GPT‑2 (15 billion parameters) showed impressive zero‑shot abilities, while GPT‑3 (175 billion) demonstrated few‑shot and zero‑shot learning across diverse tasks, from creative writing to programming.

Image
Image
Image
Image

These models proved that scaling up data, parameters, and compute dramatically improves performance.

4. Alignment – Bridging AI and Human Values (2021‑2022)

GPT‑3’s near‑human text quality raised concerns about hallucinations and alignment. Researchers introduced Supervised Fine‑Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to improve instruction following and reduce hallucinations.

4.1 Supervised Fine‑Tuning (SFT)

SFT trains models on high‑quality input‑output pairs, teaching them to follow instructions.

Image
Image

Limitations include scalability (human annotation is labor‑intensive) and generalization to unseen tasks.

4.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF ranks multiple model outputs based on human preferences, trains a reward model, and then fine‑tunes the LLM with Proximal Policy Optimization (PPO), achieving better alignment and safety.

Image
Image

4.3 ChatGPT – Conversational AI (2022)

ChatGPT, built on GPT‑3.5 and InstructGPT, was fine‑tuned on massive dialogue data and further refined with RLHF, delivering coherent multi‑turn conversations and marking a “ChatGPT moment” for mainstream AI adoption.

Image
Image

5. Multimodal Models (2023‑2024)

Models such as GPT‑4V and GPT‑4o integrate text, images, audio, and video, enabling richer interaction and problem‑solving across domains like healthcare and education.

Image
Image

5.1 GPT‑4V – Vision Meets Language

GPT‑4V combines GPT‑4’s language abilities with advanced computer vision, allowing image captioning, visual question answering, and cross‑modal reasoning.

Image
Image

5.2 GPT‑4o – Full‑Modality Frontier

GPT‑4o adds audio and video inputs, enabling transcription, video description, and text‑to‑audio synthesis, supporting creative and design workflows.

Image
Image

6. Open‑Source and Open‑Weight Models (2023‑2024)

Open‑weight LLMs such as Meta’s LLaMA series and Mistral 7B democratize access to advanced AI, allowing fine‑tuning and customization while keeping core architectures closed.

Image
Image

Community platforms like Hugging Face, LoRA, and PEFT enable efficient fine‑tuning and spur innovation across medical, legal, and creative domains.

7. Reasoning‑Focused Models (2024)

2024 saw a shift toward reasoning‑oriented LLMs that emulate System 2 thinking, inspired by dual‑process theory.

Image
Image

7.1 OpenAI‑o1 – A Leap in Reasoning (2024)

OpenAI’s o1‑preview introduced “Long Chain‑of‑Thought” (Long CoT) reasoning, allowing the model to internally decompose problems, critique solutions, and explore alternatives before presenting a concise answer.

Image
Image

8. Cost‑Effective Inference Models: DeepSeek‑R1 (2025)

LLMs typically demand massive compute. DeepSeek‑R1 and its predecessor DeepSeek‑V3 provide high performance at a fraction of the cost.

8.1 DeepSeek‑V3 (Dec 2024)

DeepSeek‑V3, a 671 billion‑parameter MoE model, uses Multi‑Head Latent Attention, DeepSeekMoE, and Multi‑Token Prediction to achieve state‑of‑the‑art performance while reducing memory usage and inference cost.

Image
Image

8.2 DeepSeek‑R1‑Zero and DeepSeek‑R1 (Jan 2025)

DeepSeek‑R1‑Zero builds on DeepSeek‑V3 using rule‑based reinforcement learning (GRPO) without a supervised‑fine‑tuning stage.

Image
Image

DeepSeek‑R1 adds a small curated dataset and additional RL training to improve readability and alignment with human preferences.

Image
Image

Distilled versions ranging from 1.5 billion to 70 billion parameters enable deployment on weaker hardware while retaining strong reasoning abilities.

Image
Image

DeepSeek‑R1 achieves competitive scores on math, coding, commonsense, and writing benchmarks at 20‑50× lower cost than comparable closed‑source models.

Image
Image

Conclusion

From the 2017 Transformer to the 2025 DeepSeek‑R1, the evolution of large language models marks a revolutionary chapter in AI. Key milestones include Transformers (2017), GPT‑3 (2020), ChatGPT (2022), and DeepSeek‑R1 (2025), each pushing the boundaries of scale, accessibility, and alignment.

LLMs are transitioning into versatile, multimodal reasoning systems that serve both general users and specialized applications, driven by breakthroughs in architecture, training scale, and cost‑effective deployment.

Transformerlarge language modelsmultimodalAI alignment
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.