Build a Minimal Retrieval‑Augmented Generation (Tiny‑RAG) from Scratch
This step‑by‑step guide explains how to implement a lightweight Retrieval‑Augmented Generation system—Tiny‑RAG—by creating embedding classes, loading and chunking documents, building a simple vector store, performing similarity search, and integrating a large language model for answer generation, complete with runnable Python code.
1. What is RAG?
Large language models often produce hallucinations, rely on outdated information, and lack domain‑specific insight. Retrieval‑Augmented Generation (RAG) mitigates these issues by first retrieving relevant passages from a document store and then feeding them to the generator, improving accuracy, freshness, and traceability.
2. Core Modules of Tiny‑RAG
Embedding (vectorization) module
Document loading and splitting module
Vector database for storing embeddings
Retrieval module that finds relevant chunks
LLM module that generates answers from retrieved context
3. Embedding Base Class
A generic BaseEmbeddings class defines the interface:
class BaseEmbeddings:
"""Base class for embeddings"""
def __init__(self, path: str, is_api: bool) -> None:
self.path = path
self.is_api = is_api
def get_embedding(self, text: str, model: str) -> List[float]:
raise NotImplementedError
@classmethod
def cosine_similarity(cls, vector1: List[float], vector2: List[float]) -> float:
"""calculate cosine similarity between two vectors"""
dot_product = np.dot(vector1, vector2)
magnitude = np.linalg.norm(vector1) * np.linalg.norm(vector2)
if not magnitude:
return 0
return dot_product / magnitudeAn OpenAIEmbedding subclass shows how to call the OpenAI API:
class OpenAIEmbedding(BaseEmbeddings):
"""class for OpenAI embeddings"""
def __init__(self, path: str = '', is_api: bool = True) -> None:
super().__init__(path, is_api)
if self.is_api:
from openai import OpenAI
self.client = OpenAI()
self.client.api_key = os.getenv("OPENAI_API_KEY")
self.client.base_url = os.getenv("OPENAI_BASE_URL")
def get_embedding(self, text: str, model: str = "text-embedding-3-large") -> List[float]:
if self.is_api:
text = text.replace("
", " ")
return self.client.embeddings.create(input=[text], model=model).data[0].embedding
else:
raise NotImplementedError4. Document Loading and Chunking
The utility reads files based on extension and splits them into token‑length chunks with overlap:
def read_file_content(cls, file_path: str):
if file_path.endswith('.pdf'):
return cls.read_pdf(file_path)
elif file_path.endswith('.md'):
return cls.read_markdown(file_path)
elif file_path.endswith('.txt'):
return cls.read_text(file_path)
else:
raise ValueError("Unsupported file type")
def get_chunk(cls, text: str, max_token_len: int = 600, cover_content: int = 150):
chunk_text = []
curr_len = 0
curr_chunk = ''
lines = text.split('
')
for line in lines:
line = line.replace(' ', '')
line_len = len(enc.encode(line))
if line_len > max_token_len:
print('warning line_len =', line_len)
if curr_len + line_len <= max_token_len:
curr_chunk += line + '
'
curr_len += line_len + 1
else:
chunk_text.append(curr_chunk)
curr_chunk = curr_chunk[-cover_content:] + line
curr_len = line_len + cover_content
if curr_chunk:
chunk_text.append(curr_chunk)
return chunk_text5. Simple Vector Store
The VectorStore class holds document chunks and their embeddings and provides persistence and similarity search:
class VectorStore:
def __init__(self, document: List[str] = ['']):
self.document = document
def get_vector(self, EmbeddingModel: BaseEmbeddings) -> List[List[float]]:
# obtain vector representations for each document chunk
pass
def persist(self, path: str = 'storage'):
# save vectors locally
pass
def load_vector(self, path: str = 'storage'):
# load vectors from disk
pass
def query(self, query: str, EmbeddingModel: BaseEmbeddings, k: int = 1) -> List[str]:
# retrieve top‑k relevant chunks
passThe query method computes the query embedding, measures cosine similarity with stored vectors using NumPy, and returns the most similar chunks:
def query(self, query: str, EmbeddingModel: BaseEmbeddings, k: int = 1) -> List[str]:
query_vector = EmbeddingModel.get_embedding(query)
result = np.array([self.get_similarity(query_vector, vector) for vector in self.vectors])
return np.array(self.document)[result.argsort()[-k:][::-1]].tolist()6. LLM Interface
A generic BaseModel defines chat and load_model. An example InternLMChat subclass loads a local transformer model and formats prompts using a dictionary of templates:
class BaseModel:
def __init__(self, path: str = '') -> None:
self.path = path
def chat(self, prompt: str, history: List[dict], content: str) -> str:
pass
def load_model(self):
pass
class InternLMChat(BaseModel):
def __init__(self, path: str = ''):
super().__init__(path)
self.load_model()
def chat(self, prompt: str, history: List = [], content: str = '') -> str:
prompt = PROMPT_TEMPLATE['InternLM_PROMPT_TEMPALTE'].format(question=prompt, context=content)
response, history = self.model.chat(self.tokenizer, prompt, history)
return response
def load_model(self):
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
self.tokenizer = AutoTokenizer.from_pretrained(self.path, trust_remote_code=True)
self.model = AutoModelForCausalLM.from_pretrained(self.path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
PROMPT_TEMPLATE = dict(
InternLM_PROMPT_TEMPALTE="""先对上下文进行内容总结,再使用上下文来回答用户的问题。如果你不知道答案,就说你不知道。总是使用中文回答。
问题: {question}
可参考的上下文:
…
{context}
…
如果给定的上下文无法让你做出回答,请回答数据库中没有这个内容,你不知道。
有用的回答:"""
)7. Tiny‑RAG Demo
Putting everything together:
from RAG.VectorBase import VectorStore
from RAG.utils import ReadFiles
from RAG.LLM import OpenAIChat, InternLMChat
from RAG.Embeddings import JinaEmbedding, ZhipuEmbedding
# Load and split documents
docs = ReadFiles('./data').get_content(max_token_len=600, cover_content=150)
vector = VectorStore(docs)
embedding = ZhipuEmbedding()
vector.get_vector(EmbeddingModel=embedding)
vector.persist(path='storage')
question = 'What is the principle of Git?'
content = vector.query(question, model='zhipu', k=1)[0]
chat = InternLMChat(path='model_path')
print(chat.chat(question, [], content))The same workflow can load a previously persisted store:
vector = VectorStore()
vector.load_vector('./storage')
question = 'What is the principle of Git?'
embedding = ZhipuEmbedding()
content = vector.query(question, EmbeddingModel=embedding, k=1)[0]
chat = InternLMChat(path='model_path')
print(chat.chat(question, [], content))8. Summary of Required Components
Embedding (vectorization) module
Document loading and splitting module
Vector database
Retrieval (similarity search) module
Large‑model (LLM) module
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
