Build an Enterprise RAG Vector Search System from Scratch with LangChain, Easysearch, and MiMo

This article walks through the complete end‑to‑end pipeline for building a production‑grade RAG system—including document chunking, embedding generation via MiMo, vector storage and kNN retrieval in Easysearch, hybrid search configuration, prompt engineering, answer generation, interactive chat, and a detailed list of common pitfalls and fixes.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Build an Enterprise RAG Vector Search System from Scratch with LangChain, Easysearch, and MiMo

Background

Enterprise knowledge‑base Q&A requires three decisions: where to store vectors, how to retrieve them, and which LLM generates answers. The chosen stack combines Easysearch (vector store with Elasticsearch‑compatible API and kNN plugin), MiMo (OpenAI‑compatible chat model that can be prompted for embeddings), and a lightweight Python orchestration layer that calls the REST APIs directly.

Architecture

Stage 1 – Offline indexing

Original document (easysearch_docs.txt)
    → RecursiveCharacterTextSplitter (500 chars per chunk, 50‑char overlap)
    → MiMo LLM semantic encoding (256‑dim vector per chunk)
    → Easysearch kNN index (knn_dense_float_vector)

Stage 2 – Online retrieval & generation

User query
    → MiMo LLM vectorisation (same model, same embedding space)
    → Easysearch knn_nearest_neighbors (Top‑K retrieval)
    → Concatenate retrieved passages + user question
    → MiMo LLM generates answer

All vectors share the same semantic space; mismatches break retrieval quality.

Step 1 – Environment preparation

Prerequisites

Running Easysearch cluster with HTTPS enabled and kNN plugin loaded.

MiMo API key (Base URL https://token-plan-cn.xiaomimimo.com/v1).

Python ≥ 3.10.

Project layout

vectorPrj/
├── .env                # API keys, connection params
├── config.py           # config loader
├── es_client.py        # Easysearch REST wrapper (requests)
├── mimo_embeddings.py   # MiMo embedding wrapper
├── indexing.py         # document chunking & vector write‑back
├── retriever.py        # kNN vector retrieval
├── rag_qa.py           # RAG chain
├── search_test.py      # retrieval verification
├── main.py             # CLI entry point
├── easysearch_docs.txt # knowledge‑base documents
└── requirements.txt    # dependencies

Install dependencies

pip install -r requirements.txt

Core packages: requests, python-dotenv, langchain, langchain‑community.

Step 2 – Generate embeddings with MiMo

Key insight

MiMo v2.5‑pro is an inference model without a dedicated embedding endpoint. Embeddings are obtained by prompting the Chat API to return a JSON array of 256‑dimensional floats.

Implementation

Text → Prompt: "You are a text semantic encoder. Map the following text to a 256‑dimensional vector. Output JSON only."
    → MiMo Chat API
    → Parse JSON → vector

Core code (mimo_embeddings.py)

class MiMoLLMEmbeddings(Embeddings):
    """Generate semantic vectors using MiMo Chat API"""
    def _batch_to_vectors(self, texts: List[str]) -> List[List[float]]:
        prompt = (
            f"You are a text semantic encoder. Please map the following {len(texts)} pieces of text to "
            f"{self.dims}-dimensional vectors. Values must be between -1.0 and 1.0.
"
            "Only output JSON, no explanations.
"
            "Format: {\"vectors\": [[...], [...], ...]}
"
        )
        msg = self._call_chat(prompt)
        content = msg.get("content", "") or msg.get("reasoning_content", "")
        return self._extract_vectors(content, len(texts), self.dims)

Gotchas & tuning

max_tokens too low : 256‑dim vectors need ~3000 tokens. max_tokens=4096 caused truncation to zeros; max_tokens=8192 stabilises output.

reasoning_content vs. content : MiMo returns both fields; either may contain the vector JSON.

Batch size : BATCH_SIZE=3 sometimes produced mis‑aligned vectors. Setting BATCH_SIZE=1 (process one chunk at a time) is slower but reliable.

Vector parsing strategies

# Strategy 1: parse JSON {"vectors": [[...], ...]}
# Strategy 2: regex‑extract numeric arrays [0.1, -0.2, ...]
# Pad or truncate to required dimension

Alternative backend

If MiMo is unavailable, set EMBEDDING_BACKEND=bge in .env to switch to the local BAAI/bge‑base‑zh‑v1.5 model without code changes.

Step 3 – Write documents to Easysearch vector index

Why not use LangChain ElasticsearchStore?

The official Elasticsearch Python client validates that the server is a genuine Elasticsearch node and rejects Easysearch, which is ES‑compatible but not Elasticsearch. Direct requests calls avoid this validation.

Custom HTTP wrapper (es_client.py)

def es_request(method, path, body=None):
    url = f"{Config.ES_HOST}/{path}"
    resp = session.request(method, url, data=json.dumps(body))
    return resp.json()

Create kNN index

mapping = {
    "mappings": {
        "properties": {
            "content": {"type": "text"},
            "source": {"type": "keyword"},
            "content_vector": {
                "type": "knn_dense_float_vector",
                "knn": {"dims": 256}
            }
        }
    }
}
es_request("PUT", "rag-easysearch-docs", body=mapping)

Key parameters: knn_dense_float_vector – required vector field type for the Easysearch kNN plugin. dims=256 – must match MiMo output; mismatched dimensions raise indexing errors. content (text) and source (keyword) enable hybrid BM25 retrieval.

Document chunking & bulk write

def index_documents():
    docs = TextLoader("easysearch_docs.txt").load()
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["

", "
", "。", "!", "?", ",", " ", ""]
    )
    chunks = splitter.split_documents(docs)
    embeddings = MiMoLLMEmbeddings()
    vectors = embeddings.embed_documents([c.page_content for c in chunks])
    actions = []
    for chunk, vec in zip(chunks, vectors):
        actions.append({
            "_index": "rag-easysearch-docs",
            "_source": {
                "content": chunk.page_content,
                "source": chunk.metadata["source"],
                "content_vector": vec
            }
        })
    es_bulk(actions)

Run indexing:

python main.py index

Step 4 – Verify vector retrieval

kNN query syntax (Easysearch native)

body = {
    "query": {
        "knn_nearest_neighbors": {
            "field": "content_vector",
            "vec": {"values": query_vector},
            "model": "exact",          # exact scan, best for <100k vectors
            "similarity": "cosine",
            "candidates": 100
        }
    }
}

Model parameter options

exact

– precise computation, scans all vectors; suitable when data < 100 k and accuracy is priority. lsh – locality‑sensitive hashing, approximate search; suitable for large data when speed is priority (requires a compatible mapping).

Hybrid search (vector + BM25)

{
  "bool": {
    "should": [
      {"knn_nearest_neighbors": {...}},   // vector similarity (score boost)
      {"match": {"content": query}}    // BM25 keyword match (score boost)
    ]
  }
}

Toggle hybrid mode with HYBRID_SEARCH=true/false in .env.

Verification command

python main.py search "security features"

Expected output shows the top‑3 documents with relevance scores. If the top results are unrelated, the embedding model or language mismatch is likely the cause.

Step 5 – Build the RAG QA chain

Full pipeline

User question → MiMo vectorisation → Easysearch kNN Top‑K → Prompt concatenation → MiMo LLM generates answer

Custom prompt (to avoid hallucination)

PROMPT = """You are an Easysearch technical expert. Answer the question strictly based on the following document content. If the document does not contain the answer, say \"No relevant information found in the documents\". Do not fabricate.

Document content:
{context}

User question: {question}

Answer:"""

QA execution

python main.py ask "What security features does Easysearch provide?"

Sample answer includes a bullet list of features and the source documents with relevance scores.

Step 6 – Interactive chat

python main.py chat

After launch, each user turn performs automatic retrieval and generation, demonstrating the end‑to‑end RAG dialogue.

Common pitfalls (four most frequent)

Pitfall 01 – kNN plugin installed but not loaded

Symptom : Index creation fails with "No handler for type knn_dense_float_vector".

Cause : Plugin appears in easysearch‑plugin list but not in _nodes/plugins.

Fix : Restart Easysearch; the plugin becomes active only after a restart.

Pitfall 02 – Vector dimension mismatch

Symptom : Mapping error during indexing.

Cause : MiMo outputs 256‑dim vectors while the index mapping defines a different dims value.

Fix : Ensure VECTOR_DIMS in config.py matches the actual embedding dimension.

Pitfall 03 – lsh model incompatible with mapping

Symptom : kNN query returns 400 error "query is not compatible with mapping".

Cause : model=lsh requires a special mapping that the default does not provide.

Fix : Switch to model=exact or create an index with the required lsh parameters.

Pitfall 04 – All vectors become zero

Symptom : Retrieval results are unrelated to the query.

Cause : MiMo output truncated because max_tokens was too low, leading to zero‑filled vectors.

Fix : Increase max_tokens=8192, set BATCH_SIZE=1, or parse reasoning_content instead of content.

Production checklist

Easysearch kNN plugin installed and loaded ( _nodes/plugins verified).

MiMo LLM wrapped as a LangChain Embeddings interface.

Documents chunked (500 chars, 50‑char overlap) and indexed in Easysearch.

Retrieval quality verified before connecting the LLM.

Custom prompt enforces document‑based answers.

Hybrid search (vector + BM25) enabled for better proper‑noun recall.

Source documents displayed with each answer for traceability.

Interactive chat mode available for end‑user experience.

Conclusion

The article demonstrates a complete RAG system built from scratch using Easysearch for storage/retrieval, MiMo for embedding and generation, and a minimal Python orchestration layer. Core decisions include bypassing the Elasticsearch client, leveraging the LLM chat API for embeddings, validating retrieval before generation, and adopting hybrid search as the default strategy. Following the provided steps, code, and pitfall mitigations enables rapid deployment of a reliable, enterprise‑grade RAG solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLangChainRAGVector SearchkNNMiMoEasysearch
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.