Artificial Intelligence 19 min read

Mastering RAG and LLM Techniques: From Retrieval to Fine‑Tuning

This article provides a comprehensive technical guide on Retrieval‑Augmented Generation (RAG), open‑source large language models such as LLaMA, fine‑tuning methods, evaluation metrics, memory‑optimization tricks, and attention‑related optimizations for modern AI systems.

NewBeeNLP

Mar 18, 2024

Mastering RAG and LLM Techniques: From Retrieval to Fine‑Tuning

1. Overall RAG workflow

Data preprocessing → chunking (critical for model performance) → text embedding → query embedding → vector retrieval → re‑ranking → feed query plus retrieved content to the LLM → output.

2. Why use external knowledge bases?

Mitigate forgetting problems.

Improve answer accuracy, authority, and timeliness.

Cover niche domains that generic models lack.

Increase controllability, interpretability, trustworthiness, and safety.

3. Evaluating RAG projects

Retrieval metrics:

MMR (Mean Reciprocal Rank): rank‑based reciprocal score.

Hits Rate: proportion of correct items within the top‑k.

NDCG.

Generation metrics:

Qualitative: completeness, correctness, relevance.

Quantitative: ROUGE‑L.

4. Hallucination and repetition issues

Hallucination: generated content is meaningless or not faithful to source data.

Repetition ("echo" problem): the model repeatedly outputs the same phrase.

5. Mitigation strategies

For hallucination: integrate external knowledge bases, add correction rules, limit output length.

For repetition: increase dataset diversity, filter duplicate/meaningless texts during preprocessing, apply synonym‑based data augmentation, adjust temperature, use post‑processing filters.

6. Root causes of these problems

Hallucination arises from (a) mismatch between training data and source data, misaligned encoders/decoders, or user queries beyond the model’s knowledge, and (b) contradictory or incomplete training signals.

Repetition stems from low‑quality data containing many duplicate or overly long passages, and from greedy decoding where the model keeps predicting the highest‑probability token, leading to loops.

7. Popular open‑source LLM – LLaMA

LLaMA follows the Transformer architecture with several modifications:

RMSNorm (pre‑normalization) for training stability.

SwiGLU activation instead of ReLU (inspired by PaLM).

Rotary positional embeddings (instead of absolute positions).

Efficient causal multi‑head attention implementation to reduce memory and runtime.

8. Common SFT (Supervised Fine‑Tuning) methods

Full‑parameter fine‑tuning.

Adapter tuning.

Prefix tuning.

Prompt tuning.

P‑Tuning v1.

LoRA.

RLHF (Reinforcement Learning from Human Feedback).

Typical learning rate: 10 % of the pre‑training rate.

9. LoRA fine‑tuning

LoRA adds a low‑rank side‑branch to a frozen pre‑trained model. During training only the low‑rank matrices A (down‑projection) and B (up‑projection) are updated; the main model weights remain unchanged. A is initialized with a random Gaussian, B with zeros, ensuring the side‑branch starts as a zero matrix.

10. Vector retrieval models used in RAG

Common ANN algorithms:

Product quantization.

Brute‑force search.

hnswlib.

KD‑tree.

11. Potential RAG improvements

Query‑side: error correction, rewriting, normalization, expansion.

Hierarchical indexing for vector databases to boost efficiency and precision.

Domain‑specific LLM fine‑tuning with knowledge bases for better relevance and timeliness.

Post‑processing of LLM outputs to filter unreasonable cases.

12. What is LangChain?

LangChain is a framework that simplifies building applications with large language models by providing modular components for data loading, chunking, embedding, vector storage, and retrieval‑augmented QA, analogous to how TensorFlow/PyTorch simplify neural network development.

13. Common LangChain modules

document_loaders – load raw documents.

text_splitter – split documents into chunks.

embedding.huggingface – generate embeddings.

vectorstores – store embeddings.

chain.RetrievalQA – perform retrieval‑augmented question answering.

14. SFT vs. RLHF comparison

SFT advantages: simple setup, requires only QA pairs, low GPU memory consumption, fast convergence.

SFT drawbacks: performance limited by quality of fine‑tuning data; high‑quality annotation is costly.

RLHF advantages: aligns model outputs with human preferences, improves safety and factuality.

RLHF drawbacks: high GPU memory usage, unstable training (PPO), requires additional reward‑model data and complex labeling.

16. Tricks to alleviate OOM during large‑model training

Gradient accumulation.

Mixed‑precision (FP16) training.

Parameter reduction (e.g., pruning, quantization).

Distributed training across multiple GPUs.

Reduce batch size.

Upgrade hardware resources.

Optimize data pipeline to load data lazily and in parallel.

17. Can LLaMA accept arbitrarily long inputs?

No. Input length is limited by compute resources, gradient stability (vanishing/exploding gradients), and inference error rates.

18. Extending context length for LLMs

Chunk the text with overlapping windows to preserve continuity.

Increase model parameters or adopt more complex architectures to capture longer dependencies.

19. Memory components during inference

Model parameters.

Input data.

Intermediate activation results.

Memory‑management strategies (e.g., delayed deallocation) that keep buffers alive to reduce allocation overhead.

20. Overview of ChatGLM

ChatGLM builds on the GLM backbone, which can act as both encoder and decoder. It uses two mask types: [mask]: BERT‑style random short‑span masking. [gmask]: GPT‑style masking of a long span at the end (used for generation).

ChatGLM2 adopts gmask exclusively for pre‑training.

21. GLU and SwiGLU activation functions

GLU introduces a gating mechanism to filter information, improving expressiveness and long‑range modeling.

SwiGLU combines the Swish function with GLU, essentially multiplying Swish by a gate.

22. Differences between LLaMA 1 and LLaMA 2

Data: LLaMA 2 trained on 2.0 TB, LLaMA 1 on 1.4 TB.

Context length: 4 k tokens for LLaMA 2 vs 2 k for LLaMA 1.

Architecture tweaks: both use rotary embeddings and pre‑normalization; LLaMA 1 uses standard LayerNorm, LLaMA 2 switches to RMSNorm; both adopt SwiGLU activation.

23. GPU memory consumption

Training typically requires ~16× the model‑parameter size (to store optimizer states, gradients, etc.). Inference needs roughly ~2× the parameter size (e.g., fp16 uses 2 bytes per weight).

24. DeepSpeed mechanisms

DeepSpeed implements data parallelism via ring all‑reduce, avoiding the bottleneck of a central parameter server.

ZeRO optimization has three stages:

ZeRO‑1: partition Adam optimizer states across GPUs.

ZeRO‑2: also partition gradients, eliminating the need for all‑gather of gradients.

ZeRO‑3: partition parameters themselves; each step requires additional all‑gather/scatter operations but drastically reduces memory per GPU.

ZeRO‑Offload moves optimizer states and gradients to CPU memory while keeping computation on GPUs.

25. Mixed‑precision training

FP16 halves memory usage compared to FP32, improves communication bandwidth, and can accelerate computation on AI accelerators.

Challenges include overflow and rounding errors.

Key techniques: weight backup, loss scaling, and precision‑aware accumulation.

26. Prefix LM vs. Causal LM

Prefix LM: tokens attend bidirectionally (full attention) in the encoder, but decoder uses unidirectional attention; examples include ChatGLM, U‑PaLM.

Causal LM: strict left‑to‑right unidirectional attention; examples include LLaMA‑7B, Qwen.

27. Optimizations for Multi‑Head Attention

KV cache: store previously computed key/value pairs to avoid recomputation.

MQA (Multi‑Query Attention): share K and V across heads, reducing memory bandwidth.

GQA (Grouped‑Query Attention): intermediate between MQA and full MHA, grouping K/V.

FlashAttention: split Q/K/V into small blocks, load from HBM to SRAM, minimizing I/O bottlenecks.

28. Common attention computation variants

Self‑attention (standard).

DIN‑style attention: retains raw weight signals without softmax, using activation‑based scaling (e.g., sigmoid) to preserve weight differences.

memory optimization LLM LangChain RAG Attention

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Overall RAG workflow

2. Why use external knowledge bases?

3. Evaluating RAG projects

4. Hallucination and repetition issues

5. Mitigation strategies

6. Root causes of these problems

7. Popular open‑source LLM – LLaMA

8. Common SFT (Supervised Fine‑Tuning) methods

9. LoRA fine‑tuning

10. Vector retrieval models used in RAG

11. Potential RAG improvements

12. What is LangChain?

13. Common LangChain modules

14. SFT vs. RLHF comparison

16. Tricks to alleviate OOM during large‑model training

17. Can LLaMA accept arbitrarily long inputs?

18. Extending context length for LLMs

19. Memory components during inference

20. Overview of ChatGLM

21. GLU and SwiGLU activation functions

22. Differences between LLaMA 1 and LLaMA 2

23. GPU memory consumption

24. DeepSpeed mechanisms

25. Mixed‑precision training

26. Prefix LM vs. Causal LM

27. Optimizations for Multi‑Head Attention

28. Common attention computation variants

NewBeeNLP

How this landed with the community

Was this worth your time?

0 Comments

22. Differences between LLaMA 1 and LLaMA 2