Artificial Intelligence 13 min read

Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching

This article provides a comprehensive technical overview of today’s large‑model ecosystem, covering the Transformer architecture, Mixture‑of‑Experts extensions, five fine‑tuning methods, the evolution from traditional RAG to agentic RAG, classic agent design patterns, diverse text‑chunking strategies, and the KV‑cache optimization that accelerates inference.

Data Party THU

Sep 3, 2025

Unlocking Large Model Secrets: Transformers, MoE, Fine‑Tuning, RAG & KV Caching

Transformer and Mixture‑of‑Experts (MoE)

The Transformer processes sequences in parallel using self‑attention, which resolves long‑range dependencies. Its core components are multi‑head attention, feed‑forward networks (FFN), layer normalization and residual connections, enabling efficient large‑scale pre‑training such as GPT and BERT.

Mixture‑of‑Experts (MoE) replaces the standard FFN with multiple expert sub‑networks (often FFNs). A gating mechanism activates only a small subset of experts per layer, providing sparse activation that dramatically increases model capacity while keeping compute cost low. Notable implementations include Google’s Switch Transformer and Meta’s FairSeq‑MoE, which scale to trillion‑parameter models.

Fine‑tuning techniques for large models

LoRA (Low‑Rank Adaptation)

Core idea: Freeze pretrained weights and inject low‑rank matrices (rank r) into linear layers, reducing trainable parameters.

Advantages: Low VRAM usage and strong multi‑task adaptability.

LoRA‑FA (LoRA with Frozen‑A)

Improvement: Keep the LoRA A‑matrix fixed (randomly initialized, not updated) and train only the B‑matrix, further cutting compute.

Use case: Extremely resource‑constrained environments where performance must be retained.

VeRA (Vector‑based Random Adaptation)

Core idea: All LoRA layers share a single random low‑rank matrix; each layer learns only a scaling vector to adjust magnitude.

Advantages: Very high parameter efficiency, suitable for edge devices.

Delta‑LoRA

Improvement: In addition to low‑rank updates on weights, the delta (difference) of the pretrained weights is also constrained by a low‑rank matrix, preserving original knowledge.

Advantages: Balances parameter updates with weight preservation.

LoRA+

Core idea: Apply asymmetric learning rates to LoRA’s A and B matrices (e.g., lr_A ≠ lr_B) to mitigate gradient imbalance.

Effect: Faster convergence and more stable fine‑tuning.

Retrieval‑Augmented Generation (RAG) evolution

Traditional RAG

Process: Retrieve relevant document chunks from a static knowledge base (BM25 or vector similarity) and concatenate them with the query for generation.

Limitations: Retrieval and generation are decoupled, leading to static handling, single‑hop reasoning, and no self‑verification.

Agentic RAG

Dynamic retrieval: The agent can rewrite queries and perform multi‑turn retrieval based on generated content.

Task awareness: Selects appropriate retrievers or generators per task.

Tool use: Invokes external APIs (e.g., calculators, search engines) to augment knowledge.

Self‑validation: Performs factual and logical consistency checks on its answers.

Empirical studies report retrieval accuracy improvements from ~45 % (traditional) to ~65 % (agentic) at the cost of higher latency.

Classic agent design patterns

Reflection Pattern: The agent evaluates its own output, identifies errors, and iteratively refines the answer.

Tool Use Pattern: The agent calls external tools (APIs, calculators, search engines) and parses their results.

ReAct Pattern: Interleaved reasoning (Reason) and action (Act) steps loop until the goal is achieved.

Planning Pattern: The agent creates a multi‑step plan before execution and can adjust it on the fly.

Multi‑agent Pattern: Multiple specialized agents cooperate or compete, communicating via voting or debate to solve complex tasks.

Text chunking strategies

Fixed‑size Chunking: Split text into equal‑length blocks (e.g., 256 tokens) with optional overlap. Simple and fast but may break sentence boundaries.

Semantic Chunking: Detect semantic boundaries (paragraphs, topic shifts) using punctuation rules or embedding similarity (e.g., Sentence‑BERT). Preserves meaning at higher computational cost.

Recursive Chunking: Hierarchical splitting—first by paragraph, then by sentence—balancing length and semantics for multi‑level processing.

Document‑Structure Chunking: Leverage inherent markup (headings, sections, tables) to define chunks, aligning with human reading order; depends on well‑structured documents.

LLM‑based Chunking: Prompt a large model (e.g., GPT‑4) to propose chunk boundaries or guide a rule engine. Highly adaptable but expensive and adds latency.

KV caching

KV‑Cache stores the key and value matrices computed by each attention layer during inference. By reusing these cached tensors for subsequent tokens, the attention cost drops from quadratic O(n²) to linear O(n), delivering 3–5× speed‑ups in generation while increasing memory usage. KV‑Cache is a core optimization in modern inference engines such as vLLM and TGI, enabling long‑context generation and real‑time interaction.

Code example

来源：数据STUDIO
本文
约3000字
，建议阅读
6
分钟
本文介绍了 Transformer 与 MoE、5 种微调技术、RAG 演进、智能体模式及 KV 缓存等大模型关键技术。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer RAG Mixture of Experts Agentic AI KV cache Fine‑tuning

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.