How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

This article maps the rapid evolution of open‑source large language models in 2025, explains the underlying architectural breakthroughs such as MLA, MoE, and NSA, compares dozens of models—including DeepSeek‑V3, OLMo2, Gemma3, Llama4, Qwen3, and Kimi‑K2—and highlights the emergence of powerful AI assistants like Dola, providing developers with a concise technical roadmap.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

Large Model Evolution Overview

By the end of 2025 the open‑source LLM landscape has exploded, with models like Claude, Kimi‑K2 and DeepSeek‑V3 delivering near‑human reasoning, multi‑million‑token context, and task‑level capabilities that were previously exclusive to closed‑source APIs.

Data‑Driven AI Assistant – Dola

Dola, an Agentic‑AI data‑analysis assistant from Tencent PCG, lets users upload a table and receive a full analysis report without writing code. It can retrieve data, run SQL, perform Python‑based processing and visualization, and generate comprehensive reports through natural‑language dialogue.

1. Language Model Development Timeline

Statistical LM (1990s) : n‑gram models with limited context.

Neural LM (2013) : RNN/LSTM, word embeddings, million‑scale parameters.

Pre‑trained LM (2018) : Transformers, self‑supervised pre‑training (BERT, GPT‑1/2), billions of parameters.

Large‑Scale LLM (2020‑present) : Hundreds of billions of parameters, prompt engineering, emergent abilities.

The recent surge is driven by three dimensions: ability quality shift, efficiency revolution, and ecosystem reconstruction.

2. Open‑Source Model Highlights

DeepSeekV3/R1 : Uses Multi‑Head Latent Attention (MLA) and a Mixture‑of‑Experts (MoE) layer (671 B total parameters, 37 B active). MLA compresses KV caches before storage, offering better memory efficiency than Grouped Query Attention (GQA).

OLMo2 : Transparent training data, RMSNorm placed after attention and feed‑forward (post‑norm), and QK‑Norm inside attention for stable training.

Gemma3 : Employs sliding‑window attention (5:1 local‑to‑global ratio) and dual RMSNorm (pre‑ and post‑norm) around grouped‑query attention.

MistralSmall 3.1 : 24 B parameters, optimized tokenizer and reduced KV cache for lower latency.

Llama4 : MoE architecture alternating dense and MoE blocks, similar to DeepSeek‑V3 but with fewer active experts.

Qwen3 : Offers dense and MoE variants, introduces Q‑/K‑norm and hardware‑aligned sparse attention (NSA) for long‑context efficiency.

SmolLM3 : 3 B parameters, NoPE (no positional encoding) and standard transformer blocks.

Kimi‑K2 : 1 T parameters, uses MLA and MoE with Muon optimizer, achieving top‑tier benchmark scores.

GLM‑4.5/4.6 : Deep‑and‑narrow MoE design (355 B total, 32 B active), GQA, 96 attention heads, RoPE up to 1 M base frequency, and loss‑free balanced routing.

3. Closed‑Source Model Highlights (2025)

GPT‑5 : Unified routing architecture, strong performance on math, coding, multimodal, and health benchmarks; hallucination rate reduced by 60%.

Claude 4/4.5 : Extended context window to 1 M tokens, superior coding (SWE‑bench 82%), reasoning, and agentic capabilities.

4. Core Technical Innovations

MLA (Multi‑Head Latent Attention) : Compresses KV caches via low‑dimensional projection, reducing memory while preserving performance.

MoE (Mixture‑of‑Experts) : Activates a small subset of experts per token, enabling billions of total parameters with modest inference cost.

NSA (Native Sparse Attention) : Block‑wise token compression, dynamic token selection, and sliding‑window refinement for ultra‑long contexts.

RMSNorm / QK‑Norm : Simplified normalization placed post‑attention (OLMo2) or both pre‑ and post‑ (Gemma3) for training stability.

5. Training Strategies

Many models adopt staged training: dense warm‑up, sparse MoE training, expert distillation, and mixed reinforcement learning (e.g., DeepSeek‑V3.2‑Exp, GLM‑4.5). Token counts range from billions to tens of trillions.

6. Comparative Tables & Visuals

Model comparison
Model comparison
DeepSeekV3 architecture
DeepSeekV3 architecture
MLA vs GQA performance
MLA vs GQA performance
class GroupedQueryAttention(nn.Module):
    def __init__(self, d_in, num_heads, num_kv_groups, head_dim=None, qk_norm=False, dtype=None):
        # ...
        if qk_norm:
            self.q_norm = RMSNorm(head_dim, eps=1e-6)
            self.k_norm = RMSNorm(head_dim, eps=1e-6)
        else:
            self.q_norm = self.k_norm = None
    def forward(self, x, mask, cos, sin):
        # projection, optional norm, RoPE, attention computation
        ...

This comprehensive review equips developers and researchers with a quick‑reference map of the latest LLM architectural trends and performance trade‑offs.

large language modelsMixture of ExpertsAI Assistantmodel architectureLLM efficiency
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.