Build a Minimal Retrieval‑Augmented Generation (Tiny‑RAG) from Scratch

This step‑by‑step guide explains how to implement a lightweight Retrieval‑Augmented Generation system—Tiny‑RAG—by creating embedding classes, loading and chunking documents, building a simple vector store, performing similarity search, and integrating a large language model for answer generation, complete with runnable Python code.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Build a Minimal Retrieval‑Augmented Generation (Tiny‑RAG) from Scratch

1. What is RAG?

Large language models often produce hallucinations, rely on outdated information, and lack domain‑specific insight. Retrieval‑Augmented Generation (RAG) mitigates these issues by first retrieving relevant passages from a document store and then feeding them to the generator, improving accuracy, freshness, and traceability.

2. Core Modules of Tiny‑RAG

Embedding (vectorization) module

Document loading and splitting module

Vector database for storing embeddings

Retrieval module that finds relevant chunks

LLM module that generates answers from retrieved context

3. Embedding Base Class

A generic BaseEmbeddings class defines the interface:

class BaseEmbeddings:
    """Base class for embeddings"""
    def __init__(self, path: str, is_api: bool) -> None:
        self.path = path
        self.is_api = is_api

    def get_embedding(self, text: str, model: str) -> List[float]:
        raise NotImplementedError

    @classmethod
    def cosine_similarity(cls, vector1: List[float], vector2: List[float]) -> float:
        """calculate cosine similarity between two vectors"""
        dot_product = np.dot(vector1, vector2)
        magnitude = np.linalg.norm(vector1) * np.linalg.norm(vector2)
        if not magnitude:
            return 0
        return dot_product / magnitude

An OpenAIEmbedding subclass shows how to call the OpenAI API:

class OpenAIEmbedding(BaseEmbeddings):
    """class for OpenAI embeddings"""
    def __init__(self, path: str = '', is_api: bool = True) -> None:
        super().__init__(path, is_api)
        if self.is_api:
            from openai import OpenAI
            self.client = OpenAI()
            self.client.api_key = os.getenv("OPENAI_API_KEY")
            self.client.base_url = os.getenv("OPENAI_BASE_URL")

    def get_embedding(self, text: str, model: str = "text-embedding-3-large") -> List[float]:
        if self.is_api:
            text = text.replace("
", " ")
            return self.client.embeddings.create(input=[text], model=model).data[0].embedding
        else:
            raise NotImplementedError

4. Document Loading and Chunking

The utility reads files based on extension and splits them into token‑length chunks with overlap:

def read_file_content(cls, file_path: str):
    if file_path.endswith('.pdf'):
        return cls.read_pdf(file_path)
    elif file_path.endswith('.md'):
        return cls.read_markdown(file_path)
    elif file_path.endswith('.txt'):
        return cls.read_text(file_path)
    else:
        raise ValueError("Unsupported file type")

def get_chunk(cls, text: str, max_token_len: int = 600, cover_content: int = 150):
    chunk_text = []
    curr_len = 0
    curr_chunk = ''
    lines = text.split('
')
    for line in lines:
        line = line.replace(' ', '')
        line_len = len(enc.encode(line))
        if line_len > max_token_len:
            print('warning line_len =', line_len)
        if curr_len + line_len <= max_token_len:
            curr_chunk += line + '
'
            curr_len += line_len + 1
        else:
            chunk_text.append(curr_chunk)
            curr_chunk = curr_chunk[-cover_content:] + line
            curr_len = line_len + cover_content
    if curr_chunk:
        chunk_text.append(curr_chunk)
    return chunk_text

5. Simple Vector Store

The VectorStore class holds document chunks and their embeddings and provides persistence and similarity search:

class VectorStore:
    def __init__(self, document: List[str] = ['']):
        self.document = document

    def get_vector(self, EmbeddingModel: BaseEmbeddings) -> List[List[float]]:
        # obtain vector representations for each document chunk
        pass

    def persist(self, path: str = 'storage'):
        # save vectors locally
        pass

    def load_vector(self, path: str = 'storage'):
        # load vectors from disk
        pass

    def query(self, query: str, EmbeddingModel: BaseEmbeddings, k: int = 1) -> List[str]:
        # retrieve top‑k relevant chunks
        pass

The query method computes the query embedding, measures cosine similarity with stored vectors using NumPy, and returns the most similar chunks:

def query(self, query: str, EmbeddingModel: BaseEmbeddings, k: int = 1) -> List[str]:
    query_vector = EmbeddingModel.get_embedding(query)
    result = np.array([self.get_similarity(query_vector, vector) for vector in self.vectors])
    return np.array(self.document)[result.argsort()[-k:][::-1]].tolist()

6. LLM Interface

A generic BaseModel defines chat and load_model. An example InternLMChat subclass loads a local transformer model and formats prompts using a dictionary of templates:

class BaseModel:
    def __init__(self, path: str = '') -> None:
        self.path = path
    def chat(self, prompt: str, history: List[dict], content: str) -> str:
        pass
    def load_model(self):
        pass

class InternLMChat(BaseModel):
    def __init__(self, path: str = ''):
        super().__init__(path)
        self.load_model()
    def chat(self, prompt: str, history: List = [], content: str = '') -> str:
        prompt = PROMPT_TEMPLATE['InternLM_PROMPT_TEMPALTE'].format(question=prompt, context=content)
        response, history = self.model.chat(self.tokenizer, prompt, history)
        return response
    def load_model(self):
        import torch
        from transformers import AutoTokenizer, AutoModelForCausalLM
        self.tokenizer = AutoTokenizer.from_pretrained(self.path, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(self.path, torch_dtype=torch.float16, trust_remote_code=True).cuda()

PROMPT_TEMPLATE = dict(
    InternLM_PROMPT_TEMPALTE="""先对上下文进行内容总结,再使用上下文来回答用户的问题。如果你不知道答案,就说你不知道。总是使用中文回答。
        问题: {question}
        可参考的上下文:
        …
        {context}
        …
        如果给定的上下文无法让你做出回答,请回答数据库中没有这个内容,你不知道。
        有用的回答:"""
)

7. Tiny‑RAG Demo

Putting everything together:

from RAG.VectorBase import VectorStore
from RAG.utils import ReadFiles
from RAG.LLM import OpenAIChat, InternLMChat
from RAG.Embeddings import JinaEmbedding, ZhipuEmbedding

# Load and split documents
docs = ReadFiles('./data').get_content(max_token_len=600, cover_content=150)
vector = VectorStore(docs)
embedding = ZhipuEmbedding()
vector.get_vector(EmbeddingModel=embedding)
vector.persist(path='storage')

question = 'What is the principle of Git?'
content = vector.query(question, model='zhipu', k=1)[0]
chat = InternLMChat(path='model_path')
print(chat.chat(question, [], content))

The same workflow can load a previously persisted store:

vector = VectorStore()
vector.load_vector('./storage')
question = 'What is the principle of Git?'
embedding = ZhipuEmbedding()
content = vector.query(question, EmbeddingModel=embedding, k=1)[0]
chat = InternLMChat(path='model_path')
print(chat.chat(question, [], content))

8. Summary of Required Components

Embedding (vectorization) module

Document loading and splitting module

Vector database

Retrieval (similarity search) module

Large‑model (LLM) module

RAG architecture diagram
RAG architecture diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLMRAGEmbeddingTutorialVector Store
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.