Artificial Intelligence 14 min read

Is RAG Dead? Meet Sirchmunk – an Embedding‑Free Search Engine that Ditches Vector Databases

Sirchmunk, an open‑source search engine from Alibaba's ModelScope team, eliminates the need for embeddings and vector databases by using a multi‑stage, Monte‑Carlo‑based pipeline that builds self‑evolving knowledge clusters, offering zero‑setup indexing, real‑time freshness, and flexible integration options.

Old Zhang's AI Learning

Feb 26, 2026

Is RAG Dead? Meet Sirchmunk – an Embedding‑Free Search Engine that Ditches Vector Databases

Sirchmunk overview

Sirchmunk is an open‑source search engine that operates without building embeddings or a vector store. It reads raw files directly and stores each search result as a structured, self‑evolving knowledge cluster.

Key capabilities

Embedding‑free search : No vector database or ETL pipeline; supports 100+ file formats (PDF, code, Markdown, etc.).

Self‑evolving knowledge clusters : Search results are retained, merged into reusable clusters, and improve with repeated use.

Monte Carlo Evidence Sampling : An explore‑exploit strategy extracts relevant evidence from large documents without processing the entire text.

ReAct agent fallback : If standard retrieval fails, a ReAct agent iteratively explores until an answer is found.

Multiple integration methods : MCP protocol (Claude Desktop, Cursor IDE), REST API, WebSocket, CLI, and Web UI.

Comparison with traditional RAG

Setup cost : Traditional RAG requires a vector DB, graph DB, parsers, etc.; Sirchmunk requires zero infrastructure.

Data freshness : Traditional RAG relies on batch re‑indexing; Sirchmunk updates in real time via self‑evolving clusters.

Scalability : Traditional RAG incurs linear cost growth; Sirchmunk has minimal RAM/CPU usage.

Accuracy : Traditional RAG uses approximate vector matching; Sirchmunk provides deterministic, context‑aware retrieval.

Workflow complexity : Traditional RAG needs a complex ETL pipeline; Sirchmunk works by dropping files and searching with zero configuration.

Multi‑stage search pipeline

Phase 0 – Knowledge‑cluster reuse : If a past query is semantically similar (cosine > 0.85), the cached cluster is returned in sub‑second time and the new query is appended to the cluster history.

Phase 1 – Parallel probing : Four independent probes run concurrently – LLM keyword extraction, directory scan, knowledge cache lookup, and path‑context loading – to maximize speed.

Phase 2 – Retrieval & ranking : IDF‑weighted keyword search retrieves candidates; an LLM re‑ranks them using metadata.

Phase 3 – Knowledge‑cluster construction : Results are merged, deduplicated, processed by Monte Carlo Evidence Sampling, and the LLM synthesizes a structured knowledge cluster.

Phase 4 – Summarization or ReAct refinement : If evidence is found, a structured summary is generated; otherwise a ReAct agent is activated to iteratively explore until an answer emerges.

Phase 5 – Persistence : Valuable clusters, together with their embeddings, are stored for future reuse.

Monte Carlo Evidence Sampling

Scatter (explore) : Fuzzy anchor matching plus hierarchical random sampling identifies promising seed regions while ensuring coverage.

Focus (exploit) : Gaussian importance sampling concentrates on high‑score seeds to densely extract likely relevant areas.

Synthesize : Top‑K fragments are handed to the LLM, which composes a coherent region‑of‑interest summary.

Document‑agnostic : Works equally on a 2‑page memo and a 500‑page technical manual.

Token‑efficient : Only the most relevant regions are sent to the LLM, drastically reducing token usage.

Explore‑exploit balance : Random exploration avoids tunnel vision; importance sampling ensures depth where needed.

Self‑evolving Knowledge Clusters

Queries are embedded and compared to stored clusters (cosine ≥ 0.85 triggers a hit).

On hit: the new query is appended (FIFO, max 5), heat score increases by 0.1 (capped at 1.0), and the embedding is recomputed to broaden semantic coverage.

On miss: the full pipeline runs to create a new cluster.

Storage uses DuckDB + Parquet with atomic writes and multi‑process safety.

Zero‑cost acceleration : Repeated or semantically similar queries bypass LLM inference, yielding near‑instant responses.

Query‑driven embeddings : Embeddings are derived from the user query, aligning with actual information needs.

Semantic widening : Reusing a cluster causes its embedding to drift, covering a broader semantic neighborhood.

Installation

# Create a virtual environment (recommended)
conda create -n sirchmunk python=3.13 -y && conda activate sirchmunk

# Install from PyPI
pip install sirchmunk

# Optional extras
pip install "sirchmunk[web]"   # Web UI
pip install "sirchmunk[mcp]"   # MCP support
pip install "sirchmunk[all]"   # All features

Requirements: Python 3.10+, an OpenAI‑compatible LLM API key, and optionally Node.js 18+ for the Web UI.

CLI usage

# Search the current directory
sirchmunk search "How does authentication work?"

# Search specific paths
sirchmunk search "find all API endpoints" ./src ./docs

# Filename‑only mode (no LLM)
sirchmunk search "config" --mode FILENAME_ONLY

# Output JSON
sirchmunk search "database schema" --output json

Python SDK usage

import asyncio
from sirchmunk import AgenticSearch
from sirchmunk.llm import OpenAIChat

llm = OpenAIChat(
    api_key="your-api-key",
    base_url="your-base-url",
    model="your-model-name",
)

async def main():
    searcher = AgenticSearch(llm=llm)
    result = await searcher.search(
        query="How does transformer attention work?",
        paths=["/path/to/documents"],
    )
    print(result)

asyncio.run(main())

MCP integration (Claude Desktop / Cursor IDE)

{
  "mcpServers": {
    "sirchmunk": {
      "command": "sirchmunk",
      "args": ["mcp", "serve"],
      "env": {
        "SIRCHMUNK_SEARCH_PATHS": "/path/to/your_docs,/another/path"
      }
    }
  }
}

Web UI

Build the frontend (requires Node.js 18+) and serve the API and UI on a single port:

# Build frontend
sirchmunk web init

# Serve API + Web UI
sirchmunk web serve

Access the UI at http://localhost:8584. The interface provides a chat view with streamed output and source citations, a knowledge page for visualizing clusters, and a monitor page for health metrics, token usage, and cluster growth curves.

Advantages

Embedding‑free design eliminates the need for vector stores and ETL pipelines, ideal for large, heterogeneous, frequently updated corpora.

Monte Carlo Evidence Sampling balances exploration and exploitation, reducing token consumption.

Self‑evolving clusters provide zero‑cost acceleration for repeated queries.

Rich integration options (MCP, REST, WebSocket, CLI, Web UI) enable seamless use in IDEs and other tools.

Lightweight storage using DuckDB + Parquet avoids external database dependencies.

Limitations

Project is early (v0.0.3) with limited community adoption (44 stars on GitHub).

Heavy reliance on LLMs; deep search modes still incur token costs.

Embedding‑free does not mean LLM‑free – keyword extraction, ranking, and evidence synthesis still call the LLM.

Stability in large‑scale production environments remains unproven.

Typical use cases

Intelligent code‑base search, especially when combined with MCP in IDEs.

Personal knowledge or document repositories with many file types and frequent updates.

Rapid prototyping when building a full RAG pipeline is undesirable.

Embedding-Free Retrieval Knowledge Clusters LLM Search Monte Carlo Sampling Sirchmunk Vector Database Alternative

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.