From GPT‑4 to Thinking Models: How LLM Architecture Evolved After 2023
This article traces the evolution of large language models from the GPT‑4 era through 2024‑2025, highlighting the shift from pure scaling to efficiency‑focused architectures, the rise of reasoning‑centric "thinking" models, and the emergence of agentic capabilities that enable tools and real‑world interaction.
1. 2023: The GPT‑4 Paradigm and the End of Parameter‑Scale‑Only Thinking
Since the release of GPT‑4 in early 2023, the LLM field followed a clear trajectory: larger parameter counts, more compute, and bigger data led to better performance, a trend known as Scaling Laws. GPT‑4 demonstrated this with an 8K/32K context window and top‑10% scores on professional benchmarks, surpassing GPT‑3.5.
1.1 2023 Baseline: GPT‑4 Paradigm
GPT‑4’s key improvements over GPT‑3.5 include a vastly expanded context window and higher reliability, creativity, and instruction following.
1.2 Cracks in the Scaling Paradigm
By late 2024, despite having the data, compute, and talent to build GPT‑5, the community began questioning the pure scaling approach, noting three intertwined pressures: the quadratic cost of dense Transformers, the need for deeper reasoning during inference, and the demand for actionable agents.
2. 2024‑Present: The Urgent Need for Efficiency
2.1 The Rise of Sparse MoE Architectures
Mixture‑of‑Experts (MoE) replaces dense feed‑forward layers with many small expert networks, activating only a subset per token. This allows total parameters to reach hundreds of billions while keeping per‑token compute low.
DeepSeek‑V2 introduced DeepSeekMoE , a 236B model that activates only 21B parameters per token, achieving a >10:1 total‑to‑active parameter ratio.
DeepSeek‑V2‑Lite (16B) uses two shared experts and 64 routed experts, activating six experts per token.
DeepSeek R1 (671B total, 37B active) proved MoE can scale to trillion‑parameter regimes economically.
Qwen’s product line combines dense (up to 32B) and MoE models (e.g., 30B‑A3B, 235B‑A22B) to serve both stable‑performance and cutting‑edge use cases.
2.2 Attention Mechanism Revolution
Standard self‑attention incurs O(L²) cost, limiting context length. New mechanisms include:
DeepSeek’s Multi‑Head Latent Attention (MLA) compresses KV caches into a low‑rank latent vector, reducing memory by 93% while supporting 128K context.
Minimax‑m1’s Lightning Attention offers linear‑time attention, interleaved with regular softmax blocks to preserve quality.
Qwen 2.5 adopts Grouped Query Attention (GQA) for more efficient KV reuse.
3. 2025: Reasoning (Thinking) Takes Center Stage
Models now allocate extra compute at inference time to generate internal “Chain‑of‑Thought” (CoT) sequences before producing the final answer, dramatically improving performance on complex logical, mathematical, and planning tasks.
3.1 OpenAI o‑Series (o1, o3, o4‑mini)
These models hide a long internal reasoning chain from users, achieving 83% accuracy on AIME problems, far surpassing GPT‑4o’s 13%.
3.2 Anthropic Claude Hybrid Reasoning
Claude 3.7 introduced a hybrid mode allowing a trade‑off between fast responses and deep “extended thinking”. Subsequent Claude 4 versions refined this with explicit modes for speed versus accuracy.
3.3 Agentic Tool Use
OpenAI’s o‑series and Anthropic’s Claude 4 provide APIs for autonomous tool usage—web search, Python execution, image generation—enabling multi‑step problem solving. Google’s Gemini 2.5 and Qwen 3 also expose a “thinking budget” parameter to control inference compute.
4. Current Landscape and Competition
4.1 Architectural Philosophies
OpenAI focuses on proprietary reasoning‑centric designs.
DeepSeek pursues open‑source MoE and MLA innovations.
Anthropic emphasizes safety and controllable hybrid reasoning.
Google integrates thinking models into its Cloud ecosystem (Gemini 2.5 Pro/Flash/Lite).
Qwen offers a flexible mix of dense and MoE models with ultra‑long context.
Minimax blends MoE, linear attention, and a novel RL algorithm (CISPO) for rapid training.
4.2 Benchmark Shifts
Traditional NLP benchmarks (MMLU, GSM8K) are saturating. New evaluation suites focus on complex reasoning (GPQA, AIME) and agentic execution (SWE‑bench, Terminal‑bench), revealing distinct SOTA leaders per task.
5. Future Trajectory and Conclusion
5.1 Toward Embodied AI
The convergence of efficiency, reasoning, and agency points toward embodied intelligence, where models predict physical trajectories (e.g., the Corki framework) and interact with the real world.
5.2 Post‑Transformer Exploration
Research continues on alternatives such as State‑Space Models and novel normalization (ResiDual), but most work augments rather than replaces Transformers.
5.3 Three Pillars of Modern AI Architecture
Efficiency – sparse MoE and advanced attention make massive models and ultra‑long context affordable.
Reasoning – allocating inference compute to “thinking” yields dramatic gains on hard tasks.
Agency – tool‑use and agentic APIs turn reasoning into actionable outcomes.
ps: This article was co‑authored by Gemini 2.5 Pro 0605.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
