Mastering RAG and LLM Techniques: From Retrieval to Fine‑Tuning
This article provides a comprehensive technical guide on Retrieval‑Augmented Generation (RAG), open‑source large language models such as LLaMA, fine‑tuning methods, evaluation metrics, memory‑optimization tricks, and attention‑related optimizations for modern AI systems.
1. Overall RAG workflow
Data preprocessing → chunking (critical for model performance) → text embedding → query embedding → vector retrieval → re‑ranking → feed query plus retrieved content to the LLM → output.
2. Why use external knowledge bases?
Mitigate forgetting problems.
Improve answer accuracy, authority, and timeliness.
Cover niche domains that generic models lack.
Increase controllability, interpretability, trustworthiness, and safety.
3. Evaluating RAG projects
Retrieval metrics:
MMR (Mean Reciprocal Rank): rank‑based reciprocal score.
Hits Rate: proportion of correct items within the top‑k.
NDCG.
Generation metrics:
Qualitative: completeness, correctness, relevance.
Quantitative: ROUGE‑L.
4. Hallucination and repetition issues
Hallucination: generated content is meaningless or not faithful to source data.
Repetition ("echo" problem): the model repeatedly outputs the same phrase.
5. Mitigation strategies
For hallucination: integrate external knowledge bases, add correction rules, limit output length.
For repetition: increase dataset diversity, filter duplicate/meaningless texts during preprocessing, apply synonym‑based data augmentation, adjust temperature, use post‑processing filters.
6. Root causes of these problems
Hallucination arises from (a) mismatch between training data and source data, misaligned encoders/decoders, or user queries beyond the model’s knowledge, and (b) contradictory or incomplete training signals.
Repetition stems from low‑quality data containing many duplicate or overly long passages, and from greedy decoding where the model keeps predicting the highest‑probability token, leading to loops.
7. Popular open‑source LLM – LLaMA
LLaMA follows the Transformer architecture with several modifications:
RMSNorm (pre‑normalization) for training stability.
SwiGLU activation instead of ReLU (inspired by PaLM).
Rotary positional embeddings (instead of absolute positions).
Efficient causal multi‑head attention implementation to reduce memory and runtime.
8. Common SFT (Supervised Fine‑Tuning) methods
Full‑parameter fine‑tuning.
Adapter tuning.
Prefix tuning.
Prompt tuning.
P‑Tuning v1.
LoRA.
RLHF (Reinforcement Learning from Human Feedback).
Typical learning rate: 10 % of the pre‑training rate.
9. LoRA fine‑tuning
LoRA adds a low‑rank side‑branch to a frozen pre‑trained model. During training only the low‑rank matrices A (down‑projection) and B (up‑projection) are updated; the main model weights remain unchanged. A is initialized with a random Gaussian, B with zeros, ensuring the side‑branch starts as a zero matrix.
10. Vector retrieval models used in RAG
Common ANN algorithms:
Product quantization.
Brute‑force search.
hnswlib.
KD‑tree.
11. Potential RAG improvements
Query‑side: error correction, rewriting, normalization, expansion.
Hierarchical indexing for vector databases to boost efficiency and precision.
Domain‑specific LLM fine‑tuning with knowledge bases for better relevance and timeliness.
Post‑processing of LLM outputs to filter unreasonable cases.
12. What is LangChain?
LangChain is a framework that simplifies building applications with large language models by providing modular components for data loading, chunking, embedding, vector storage, and retrieval‑augmented QA, analogous to how TensorFlow/PyTorch simplify neural network development.
13. Common LangChain modules
document_loaders – load raw documents.
text_splitter – split documents into chunks.
embedding.huggingface – generate embeddings.
vectorstores – store embeddings.
chain.RetrievalQA – perform retrieval‑augmented question answering.
14. SFT vs. RLHF comparison
SFT advantages: simple setup, requires only QA pairs, low GPU memory consumption, fast convergence.
SFT drawbacks: performance limited by quality of fine‑tuning data; high‑quality annotation is costly.
RLHF advantages: aligns model outputs with human preferences, improves safety and factuality.
RLHF drawbacks: high GPU memory usage, unstable training (PPO), requires additional reward‑model data and complex labeling.
16. Tricks to alleviate OOM during large‑model training
Gradient accumulation.
Mixed‑precision (FP16) training.
Parameter reduction (e.g., pruning, quantization).
Distributed training across multiple GPUs.
Reduce batch size.
Upgrade hardware resources.
Optimize data pipeline to load data lazily and in parallel.
17. Can LLaMA accept arbitrarily long inputs?
No. Input length is limited by compute resources, gradient stability (vanishing/exploding gradients), and inference error rates.
18. Extending context length for LLMs
Chunk the text with overlapping windows to preserve continuity.
Increase model parameters or adopt more complex architectures to capture longer dependencies.
19. Memory components during inference
Model parameters.
Input data.
Intermediate activation results.
Memory‑management strategies (e.g., delayed deallocation) that keep buffers alive to reduce allocation overhead.
20. Overview of ChatGLM
ChatGLM builds on the GLM backbone, which can act as both encoder and decoder. It uses two mask types: [mask]: BERT‑style random short‑span masking. [gmask]: GPT‑style masking of a long span at the end (used for generation).
ChatGLM2 adopts gmask exclusively for pre‑training.
21. GLU and SwiGLU activation functions
GLU introduces a gating mechanism to filter information, improving expressiveness and long‑range modeling.
SwiGLU combines the Swish function with GLU, essentially multiplying Swish by a gate.
22. Differences between LLaMA 1 and LLaMA 2
Data: LLaMA 2 trained on 2.0 TB, LLaMA 1 on 1.4 TB.
Context length: 4 k tokens for LLaMA 2 vs 2 k for LLaMA 1.
Architecture tweaks: both use rotary embeddings and pre‑normalization; LLaMA 1 uses standard LayerNorm, LLaMA 2 switches to RMSNorm; both adopt SwiGLU activation.
23. GPU memory consumption
Training typically requires ~16× the model‑parameter size (to store optimizer states, gradients, etc.). Inference needs roughly ~2× the parameter size (e.g., fp16 uses 2 bytes per weight).
24. DeepSpeed mechanisms
DeepSpeed implements data parallelism via ring all‑reduce, avoiding the bottleneck of a central parameter server.
ZeRO optimization has three stages:
ZeRO‑1: partition Adam optimizer states across GPUs.
ZeRO‑2: also partition gradients, eliminating the need for all‑gather of gradients.
ZeRO‑3: partition parameters themselves; each step requires additional all‑gather/scatter operations but drastically reduces memory per GPU.
ZeRO‑Offload moves optimizer states and gradients to CPU memory while keeping computation on GPUs.
25. Mixed‑precision training
FP16 halves memory usage compared to FP32, improves communication bandwidth, and can accelerate computation on AI accelerators.
Challenges include overflow and rounding errors.
Key techniques: weight backup, loss scaling, and precision‑aware accumulation.
26. Prefix LM vs. Causal LM
Prefix LM: tokens attend bidirectionally (full attention) in the encoder, but decoder uses unidirectional attention; examples include ChatGLM, U‑PaLM.
Causal LM: strict left‑to‑right unidirectional attention; examples include LLaMA‑7B, Qwen.
27. Optimizations for Multi‑Head Attention
KV cache: store previously computed key/value pairs to avoid recomputation.
MQA (Multi‑Query Attention): share K and V across heads, reducing memory bandwidth.
GQA (Grouped‑Query Attention): intermediate between MQA and full MHA, grouping K/V.
FlashAttention: split Q/K/V into small blocks, load from HBM to SRAM, minimizing I/O bottlenecks.
28. Common attention computation variants
Self‑attention (standard).
DIN‑style attention: retains raw weight signals without softmax, using activation‑based scaling (e.g., sigmoid) to preserve weight differences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
