Artificial Intelligence 24 min read

From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)

Since GPT‑4’s 2023 debut, large language models have shifted from sheer scale to efficiency‑driven designs, advanced reasoning with chain‑of‑thought, and agentic tool use, as illustrated by MoE, MLA, and new attention mechanisms, reshaping benchmarks, commercial strategies, and the future of AI.

DaTaobao Tech

Jul 16, 2025

From GPT‑4 to Agentic AI: How LLM Architecture Evolved (2023‑2025)

1. GPT‑4 and the Scaling Paradigm

GPT‑4, released in early 2023, demonstrated that larger parameters, longer context windows (8K‑32K) and dense Transformer architecture could achieve near‑human performance on professional benchmarks, reinforcing the “scale‑is‑all” belief.

2. Emerging Limitations of Pure Scaling

By 2024 the community recognized the inefficiency of dense models: quadratic attention cost, high inference expense and limited returns on further parameter growth.

3. Efficiency‑Driven Innovations

Mixture‑of‑Experts (MoE) sparsity (e.g., DeepSeek‑V2, DeepSeek‑R1, Qwen) reduces activated parameters while keeping huge total size.

New attention mechanisms such as Multi‑Head Latent Attention (MLA), Lightning Attention, and Grouped Query Attention compress KV caches and achieve linear or sub‑quadratic complexity.

4. Reasoning and Chain‑of‑Thought

Models like OpenAI o‑series and Anthropic Claude introduced explicit “thinking” phases, allocating extra compute at inference to generate internal reasoning chains, dramatically improving performance on math and logic benchmarks (e.g., AIME, GPQA).

5. Agentic Tool Use

Recent models (OpenAI o3/o4‑mini, Claude 4, Gemini 2.5) can autonomously select and invoke external tools—search, code execution, image generation—turning reasoning into actionable plans.

6. Reinforcement Learning for Reasoning

RL pipelines (DeepSeek‑R1’s GRPO, Minimax‑m1’s CISPO) train models to produce coherent reasoning steps and self‑correct, reducing reliance on massive labeled datasets.

7. Benchmark Shift

Traditional knowledge benchmarks (MMLU, GSM8K) are saturated; newer evaluations focus on complex reasoning (GPQA, AIME) and agentic tasks (SWE‑bench, Terminal‑bench), redefining SOTA per capability.

8. Competitive Landscape

OpenAI focuses on proprietary reasoning and agents; DeepSeek and Qwen pursue open‑source, MoE‑centric efficiency; Anthropic emphasizes safety‑driven reasoning; Google offers tiered Gemini models integrated with Cloud.

9. Future Directions

Beyond Transformers, research explores post‑Transformer architectures, dynamic low‑rank projections, and world‑model integration for embodied AI, while efficiency remains the strategic moat.