Artificial Intelligence 27 min read

From GPT‑4 to Thinking Models: How LLM Architecture Evolved After 2023

This article traces the evolution of large language models from the GPT‑4 era through 2024‑2025, highlighting the shift from pure scaling to efficiency‑focused architectures, the rise of reasoning‑centric "thinking" models, and the emergence of agentic capabilities that enable tools and real‑world interaction.

Alibaba Cloud Developer

Jul 8, 2025

From GPT‑4 to Thinking Models: How LLM Architecture Evolved After 2023

1. 2023: The GPT‑4 Paradigm and the End of Parameter‑Scale‑Only Thinking

Since the release of GPT‑4 in early 2023, the LLM field followed a clear trajectory: larger parameter counts, more compute, and bigger data led to better performance, a trend known as Scaling Laws. GPT‑4 demonstrated this with an 8K/32K context window and top‑10% scores on professional benchmarks, surpassing GPT‑3.5.

1.1 2023 Baseline: GPT‑4 Paradigm

GPT‑4’s key improvements over GPT‑3.5 include a vastly expanded context window and higher reliability, creativity, and instruction following.

1.2 Cracks in the Scaling Paradigm

By late 2024, despite having the data, compute, and talent to build GPT‑5, the community began questioning the pure scaling approach, noting three intertwined pressures: the quadratic cost of dense Transformers, the need for deeper reasoning during inference, and the demand for actionable agents.

2. 2024‑Present: The Urgent Need for Efficiency

2.1 The Rise of Sparse MoE Architectures

Mixture‑of‑Experts (MoE) replaces dense feed‑forward layers with many small expert networks, activating only a subset per token. This allows total parameters to reach hundreds of billions while keeping per‑token compute low.

DeepSeek‑V2 introduced DeepSeekMoE , a 236B model that activates only 21B parameters per token, achieving a >10:1 total‑to‑active parameter ratio.

DeepSeek‑V2‑Lite (16B) uses two shared experts and 64 routed experts, activating six experts per token.

DeepSeek R1 (671B total, 37B active) proved MoE can scale to trillion‑parameter regimes economically.

Qwen’s product line combines dense (up to 32B) and MoE models (e.g., 30B‑A3B, 235B‑A22B) to serve both stable‑performance and cutting‑edge use cases.

2.2 Attention Mechanism Revolution

Standard self‑attention incurs O(L²) cost, limiting context length. New mechanisms include:

DeepSeek’s Multi‑Head Latent Attention (MLA) compresses KV caches into a low‑rank latent vector, reducing memory by 93% while supporting 128K context.

Minimax‑m1’s Lightning Attention offers linear‑time attention, interleaved with regular softmax blocks to preserve quality.

Qwen 2.5 adopts Grouped Query Attention (GQA) for more efficient KV reuse.

3. 2025: Reasoning (Thinking) Takes Center Stage

Models now allocate extra compute at inference time to generate internal “Chain‑of‑Thought” (CoT) sequences before producing the final answer, dramatically improving performance on complex logical, mathematical, and planning tasks.

3.1 OpenAI o‑Series (o1, o3, o4‑mini)

These models hide a long internal reasoning chain from users, achieving 83% accuracy on AIME problems, far surpassing GPT‑4o’s 13%.

3.2 Anthropic Claude Hybrid Reasoning

Claude 3.7 introduced a hybrid mode allowing a trade‑off between fast responses and deep “extended thinking”. Subsequent Claude 4 versions refined this with explicit modes for speed versus accuracy.

3.3 Agentic Tool Use

OpenAI’s o‑series and Anthropic’s Claude 4 provide APIs for autonomous tool usage—web search, Python execution, image generation—enabling multi‑step problem solving. Google’s Gemini 2.5 and Qwen 3 also expose a “thinking budget” parameter to control inference compute.

4. Current Landscape and Competition

4.1 Architectural Philosophies

OpenAI focuses on proprietary reasoning‑centric designs.

DeepSeek pursues open‑source MoE and MLA innovations.

Anthropic emphasizes safety and controllable hybrid reasoning.

Google integrates thinking models into its Cloud ecosystem (Gemini 2.5 Pro/Flash/Lite).

Qwen offers a flexible mix of dense and MoE models with ultra‑long context.

Minimax blends MoE, linear attention, and a novel RL algorithm (CISPO) for rapid training.

4.2 Benchmark Shifts

Traditional NLP benchmarks (MMLU, GSM8K) are saturating. New evaluation suites focus on complex reasoning (GPQA, AIME) and agentic execution (SWE‑bench, Terminal‑bench), revealing distinct SOTA leaders per task.

5. Future Trajectory and Conclusion

5.1 Toward Embodied AI

The convergence of efficiency, reasoning, and agency points toward embodied intelligence, where models predict physical trajectories (e.g., the Corki framework) and interact with the real world.

5.2 Post‑Transformer Exploration

Research continues on alternatives such as State‑Space Models and novel normalization (ResiDual), but most work augments rather than replaces Transformers.

5.3 Three Pillars of Modern AI Architecture

Efficiency – sparse MoE and advanced attention make massive models and ultra‑long context affordable.

Reasoning – allocating inference compute to “thinking” yields dramatic gains on hard tasks.

Agency – tool‑use and agentic APIs turn reasoning into actionable outcomes.

ps: This article was co‑authored by Gemini 2.5 Pro 0605.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer reasoning Agents

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.