From Transformers to DeepSeek-R1: Evolution of Large Language Models
Since the 2017 introduction of the Transformer architecture, this article chronicles the rapid development of large language models—including BERT, GPT series, multimodal systems, and the cost‑effective DeepSeek‑R1—highlighting key innovations, scaling trends, alignment techniques, and their transformative impact across AI research and industry.
Introduction
In early 2025 China launched the cost‑effective large language model DeepSeek‑R1, sparking a major AI shift. This article reviews LLM development, beginning with the 2017 Transformer architecture that reshaped natural‑language processing through self‑attention.
Early Milestones (2017‑2020)
Transformer Revolution (2017)
Vaswani et al. introduced the Transformer in "Attention is All You Need," replacing RNNs/LSTMs with parallel self‑attention, enabling large‑scale training.
Rise of BERT and GPT (2018‑2020)
BERT and GPT demonstrated superior contextual understanding and text generation. GPT‑3 (2020) with 175 billion parameters showcased few‑shot and zero‑shot learning, though hallucination remained a challenge.
Alignment Techniques (2021‑2022)
OpenAI addressed hallucination with supervised fine‑tuning (SFT) and reinforcement learning from human feedback (RLHF), improving instruction following and reducing false outputs.
Multimodal and Reasoning Models (2022‑2025)
ChatGPT and GPT‑4 Series
ChatGPT (2022) refined dialogue via RLHF. GPT‑4V (2023) added visual understanding, while GPT‑4o (2024) incorporated audio and video, enabling richer multimodal interactions.
Reasoning Models
OpenAI‑o1 (2024) introduced chain‑of‑thought reasoning, achieving near‑doctoral performance on scientific benchmarks. DeepSeek‑R1 (2025) combined reasoning with cost‑efficiency, offering open‑weight, high‑performance LLMs.
Technical Foundations
Self‑Attention and Multi‑Head Attention
Self‑attention computes query (Q), key (K), and value (V) matrices, allowing each token to weigh all others. Multi‑head attention runs several attention heads in parallel, enriching contextual representation.
Feed‑Forward Networks, Layer Normalization, and Positional Encoding
Each Transformer layer includes a feed‑forward network, layer‑norm, and residual connections. Positional encodings inject sequence order information.
Scaling Trends
Model size, dataset scale, and compute power have driven performance gains. Larger models (e.g., GPT‑3, GPT‑4) achieve better generalization, but require massive resources.
Open‑Weight and Open‑Source Models
From 2023 onward, open‑weight LLMs (e.g., LLaMA, Mistral) and fully open‑source models (e.g., OPT, BERT) democratized access, fostering community‑driven innovation and domain‑specific adaptations.
Cost‑Effective Reasoning Models
DeepSeek‑V3 and DeepSeek‑R1
DeepSeek‑V3 (2024) introduced multi‑head latent attention, expert‑mix MoE, and multi‑token prediction, reducing inference cost dramatically. DeepSeek‑R1‑Zero eliminated SFT, using rule‑based RL (GRPO). DeepSeek‑R1 added limited high‑quality data and additional RL stages, achieving competitive benchmarks at 20‑50× lower cost.
Impact and Outlook
Advancements from Transformers to DeepSeek‑R1 illustrate a revolution in AI, with scaling, alignment, and openness shaping future research and industry adoption.
References
Original article: https://medium.com/@lmpo/大型语言模型简史-从transformer-2017-到deepseek-r1-2025-cc54d658fb43
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
