Artificial Intelligence 29 min read

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

This article chronicles the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, multimodal models, and recent cost‑efficient innovations like DeepSeek‑R1, highlighting key architectures, training methods, alignment techniques, and their transformative impact on AI applications.

Data Thinking Notes

Apr 29, 2025

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

In early 2025, China launched DeepSeek‑R1, a groundbreaking and cost‑effective large language model (LLM) that triggered a major shift in the AI field.

This article reviews the history of LLMs, beginning with the revolutionary Transformer architecture introduced in 2017.

1. What Is a Language Model?

A language model is an AI system designed to process, understand, and generate human‑like language. It learns patterns and structures from massive datasets, enabling applications such as translation, summarization, chatbots, and content generation.

1.1 Large Language Models (LLMs)

LLMs are a subset of language models that are orders of magnitude larger, often containing billions of parameters (e.g., GPT‑3 with 175 billion). Their scale gives them superior performance across a wide range of tasks.

The term “LLM” gained attention between 2018‑2019 with the rise of Transformer‑based models such as BERT and GPT‑1, and became widespread after GPT‑3’s release in 2020.

1.2 Autoregressive Language Models

Most LLMs operate autoregressively: they predict the next token based on preceding text, learning complex language patterns and excelling at text generation.

Mathematically, an LLM defines a probability distribution over the next token \(x_n\) given the previous tokens \(x_{1:n-1}\): p(x_n \mid x_{1:n-1}) During generation, decoding strategies such as greedy search or sampling from the probability distribution are used, producing diverse outputs similar to human language variability.

1.3 Generation Capability

LLMs generate text token‑by‑token from a prompt, iteratively adding predicted tokens until a stop condition is met, much like a “word‑chain” game.

This capability fuels applications such as creative writing, conversational AI, and automated customer support.

2. The Transformer Revolution (2017)

Vaswani et al. introduced the Transformer in the paper “Attention Is All You Need,” solving key limitations of earlier RNN and LSTM models, which struggled with long‑range dependencies and suffered from gradient‑vanishing problems.

2.1 Key Innovations of the Transformer

Self‑Attention : Allows each token to weigh its relevance to every other token, enabling parallel computation and global context understanding.

Q, K, V matrices represent queries, keys, and values; the dot‑product of Q and K determines attention weights.

Multi‑Head Attention : Multiple attention heads operate in parallel, each focusing on different aspects of the input.

Additional components include Feed‑Forward Networks, Layer Normalization, residual connections, and positional encodings that inject order information.

Impact on Language Modeling

Scalability: Full parallelism enables training on massive datasets.

Contextual Understanding: Self‑attention captures both local and global dependencies.

3. The Pre‑Training Era (2018‑2020)

The Transformer paved the way for pre‑trained models such as BERT and GPT, which demonstrated the power of large‑scale pre‑training followed by task‑specific fine‑tuning.

3.1 BERT – Bidirectional Context (2018)

Google’s BERT uses a bidirectional encoder to capture context from both directions, achieving state‑of‑the‑art results on tasks like classification, NER, and sentiment analysis.

Key innovations:

Masked Language Modeling (MLM): Randomly mask tokens and predict them using full sentence context.

Next Sentence Prediction (NSP): Predict whether two sentences appear consecutively, improving coherence understanding.

3.2 GPT – Autoregressive Generation (2018‑2020)

OpenAI’s GPT series focuses on autoregressive pre‑training using the Transformer decoder, excelling at text generation.

GPT‑2 (15 billion parameters) showed impressive zero‑shot abilities, while GPT‑3 (175 billion) demonstrated few‑shot and zero‑shot learning across diverse tasks, from creative writing to programming.

These models proved that scaling up data, parameters, and compute dramatically improves performance.

4. Alignment – Bridging AI and Human Values (2021‑2022)

GPT‑3’s near‑human text quality raised concerns about hallucinations and alignment. Researchers introduced Supervised Fine‑Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to improve instruction following and reduce hallucinations.

4.1 Supervised Fine‑Tuning (SFT)

SFT trains models on high‑quality input‑output pairs, teaching them to follow instructions.

Limitations include scalability (human annotation is labor‑intensive) and generalization to unseen tasks.

4.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF ranks multiple model outputs based on human preferences, trains a reward model, and then fine‑tunes the LLM with Proximal Policy Optimization (PPO), achieving better alignment and safety.

4.3 ChatGPT – Conversational AI (2022)

ChatGPT, built on GPT‑3.5 and InstructGPT, was fine‑tuned on massive dialogue data and further refined with RLHF, delivering coherent multi‑turn conversations and marking a “ChatGPT moment” for mainstream AI adoption.

5. Multimodal Models (2023‑2024)

Models such as GPT‑4V and GPT‑4o integrate text, images, audio, and video, enabling richer interaction and problem‑solving across domains like healthcare and education.

5.1 GPT‑4V – Vision Meets Language

GPT‑4V combines GPT‑4’s language abilities with advanced computer vision, allowing image captioning, visual question answering, and cross‑modal reasoning.

5.2 GPT‑4o – Full‑Modality Frontier

GPT‑4o adds audio and video inputs, enabling transcription, video description, and text‑to‑audio synthesis, supporting creative and design workflows.

6. Open‑Source and Open‑Weight Models (2023‑2024)

Open‑weight LLMs such as Meta’s LLaMA series and Mistral 7B democratize access to advanced AI, allowing fine‑tuning and customization while keeping core architectures closed.

Community platforms like Hugging Face, LoRA, and PEFT enable efficient fine‑tuning and spur innovation across medical, legal, and creative domains.

7. Reasoning‑Focused Models (2024)

2024 saw a shift toward reasoning‑oriented LLMs that emulate System 2 thinking, inspired by dual‑process theory.

7.1 OpenAI‑o1 – A Leap in Reasoning (2024)

OpenAI’s o1‑preview introduced “Long Chain‑of‑Thought” (Long CoT) reasoning, allowing the model to internally decompose problems, critique solutions, and explore alternatives before presenting a concise answer.

8. Cost‑Effective Inference Models: DeepSeek‑R1 (2025)

LLMs typically demand massive compute. DeepSeek‑R1 and its predecessor DeepSeek‑V3 provide high performance at a fraction of the cost.

8.1 DeepSeek‑V3 (Dec 2024)

DeepSeek‑V3, a 671 billion‑parameter MoE model, uses Multi‑Head Latent Attention, DeepSeekMoE, and Multi‑Token Prediction to achieve state‑of‑the‑art performance while reducing memory usage and inference cost.

8.2 DeepSeek‑R1‑Zero and DeepSeek‑R1 (Jan 2025)

DeepSeek‑R1‑Zero builds on DeepSeek‑V3 using rule‑based reinforcement learning (GRPO) without a supervised‑fine‑tuning stage.

DeepSeek‑R1 adds a small curated dataset and additional RL training to improve readability and alignment with human preferences.

Distilled versions ranging from 1.5 billion to 70 billion parameters enable deployment on weaker hardware while retaining strong reasoning abilities.

DeepSeek‑R1 achieves competitive scores on math, coding, commonsense, and writing benchmarks at 20‑50× lower cost than comparable closed‑source models.

Conclusion

From the 2017 Transformer to the 2025 DeepSeek‑R1, the evolution of large language models marks a revolutionary chapter in AI. Key milestones include Transformers (2017), GPT‑3 (2020), ChatGPT (2022), and DeepSeek‑R1 (2025), each pushing the boundaries of scale, accessibility, and alignment.

LLMs are transitioning into versatile, multimodal reasoning systems that serve both general users and specialized applications, driven by breakthroughs in architecture, training scale, and cost‑effective deployment.

Transformer large language models multimodal AI alignment

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. What Is a Language Model?

1.1 Large Language Models (LLMs)

1.2 Autoregressive Language Models

1.3 Generation Capability

2. The Transformer Revolution (2017)

2.1 Key Innovations of the Transformer

Impact on Language Modeling

3. The Pre‑Training Era (2018‑2020)

3.1 BERT – Bidirectional Context (2018)

3.2 GPT – Autoregressive Generation (2018‑2020)

4. Alignment – Bridging AI and Human Values (2021‑2022)

4.1 Supervised Fine‑Tuning (SFT)

4.2 Reinforcement Learning from Human Feedback (RLHF)

4.3 ChatGPT – Conversational AI (2022)

5. Multimodal Models (2023‑2024)

5.1 GPT‑4V – Vision Meets Language

5.2 GPT‑4o – Full‑Modality Frontier

6. Open‑Source and Open‑Weight Models (2023‑2024)

7. Reasoning‑Focused Models (2024)

7.1 OpenAI‑o1 – A Leap in Reasoning (2024)

8. Cost‑Effective Inference Models: DeepSeek‑R1 (2025)

8.1 DeepSeek‑V3 (Dec 2024)

8.2 DeepSeek‑R1‑Zero and DeepSeek‑R1 (Jan 2025)

Conclusion

Data Thinking Notes

How this landed with the community

Was this worth your time?

0 Comments

8.1 DeepSeek‑V3 (Dec 2024)

8.2 DeepSeek‑R1‑Zero and DeepSeek‑R1 (Jan 2025)