From Transformers to DeepSeek-R1: Evolution of Large Language Models

Since the 2017 introduction of the Transformer architecture, this article chronicles the rapid development of large language models—including BERT, GPT series, multimodal systems, and the cost‑effective DeepSeek‑R1—highlighting key innovations, scaling trends, alignment techniques, and their transformative impact across AI research and industry.

21CTO
21CTO
21CTO
From Transformers to DeepSeek-R1: Evolution of Large Language Models

Introduction

In early 2025 China launched the cost‑effective large language model DeepSeek‑R1, sparking a major AI shift. This article reviews LLM development, beginning with the 2017 Transformer architecture that reshaped natural‑language processing through self‑attention.

Early Milestones (2017‑2020)

Transformer Revolution (2017)

Vaswani et al. introduced the Transformer in "Attention is All You Need," replacing RNNs/LSTMs with parallel self‑attention, enabling large‑scale training.

Rise of BERT and GPT (2018‑2020)

BERT and GPT demonstrated superior contextual understanding and text generation. GPT‑3 (2020) with 175 billion parameters showcased few‑shot and zero‑shot learning, though hallucination remained a challenge.

Alignment Techniques (2021‑2022)

OpenAI addressed hallucination with supervised fine‑tuning (SFT) and reinforcement learning from human feedback (RLHF), improving instruction following and reducing false outputs.

Multimodal and Reasoning Models (2022‑2025)

ChatGPT and GPT‑4 Series

ChatGPT (2022) refined dialogue via RLHF. GPT‑4V (2023) added visual understanding, while GPT‑4o (2024) incorporated audio and video, enabling richer multimodal interactions.

Reasoning Models

OpenAI‑o1 (2024) introduced chain‑of‑thought reasoning, achieving near‑doctoral performance on scientific benchmarks. DeepSeek‑R1 (2025) combined reasoning with cost‑efficiency, offering open‑weight, high‑performance LLMs.

Technical Foundations

Self‑Attention and Multi‑Head Attention

Self‑attention computes query (Q), key (K), and value (V) matrices, allowing each token to weigh all others. Multi‑head attention runs several attention heads in parallel, enriching contextual representation.

Feed‑Forward Networks, Layer Normalization, and Positional Encoding

Each Transformer layer includes a feed‑forward network, layer‑norm, and residual connections. Positional encodings inject sequence order information.

Scaling Trends

Model size, dataset scale, and compute power have driven performance gains. Larger models (e.g., GPT‑3, GPT‑4) achieve better generalization, but require massive resources.

Open‑Weight and Open‑Source Models

From 2023 onward, open‑weight LLMs (e.g., LLaMA, Mistral) and fully open‑source models (e.g., OPT, BERT) democratized access, fostering community‑driven innovation and domain‑specific adaptations.

Cost‑Effective Reasoning Models

DeepSeek‑V3 and DeepSeek‑R1

DeepSeek‑V3 (2024) introduced multi‑head latent attention, expert‑mix MoE, and multi‑token prediction, reducing inference cost dramatically. DeepSeek‑R1‑Zero eliminated SFT, using rule‑based RL (GRPO). DeepSeek‑R1 added limited high‑quality data and additional RL stages, achieving competitive benchmarks at 20‑50× lower cost.

Impact and Outlook

Advancements from Transformers to DeepSeek‑R1 illustrate a revolution in AI, with scaling, alignment, and openness shaping future research and industry adoption.

References

Original article: https://medium.com/@lmpo/大型语言模型简史-从transformer-2017-到deepseek-r1-2025-cc54d658fb43
Transformerlarge language modelsDeepSeekAI evolutionLLM History
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.