How Chocolate Factory’s Codebase AI Assistant Boosts Code Search with RAG
This article explains the design and implementation of the Codebase AI Assistant in the Chocolate Factory framework, covering its problem‑solving DSL, retrieval‑augmented generation pipeline, indexing and querying stages, prompt strategies, and code‑splitting rules that together enable efficient semantic code search.
The Chocolate Factory (CF) framework now includes a Codebase AI Assistant, an intelligent tool that lets developers query large codebases using natural language. It analyzes questions, retrieves relevant code fragments, and generates concise answers or suggested fixes.
Example Interaction
Question: How is Semantic Workflow implemented?
Answer (generated by the assistant): The workflow creates an ElasticsearchStore, extracts the last user message as the query, uses a SemanticProblemAnalyzer with an LLM provider to parse the problem, executes a semantic code search via SemanticSolutionExecutor, converts the result to a Flowable<Answer>, and returns it.
The execution flow can be visualized with PlantUML:
@startuml
start
:Create ElasticsearchStore object;
:Get last user message as query;
:Analyze problem with SemanticProblemAnalyzer;
:Execute semantic code search with SemanticSolutionExecutor;
:Convert result to Flowable<Answer>;
stop
@endumlDesign Overview
The assistant follows Domain‑Driven Design and operates in two main phases.
Problem Solving (DSL Construction) : An LLM analyses the user question and produces three retrieval conditions: englishQuery – translate non‑English input to English before searching. originLanguageQuery – search using the original language when translation may lose meaning. hypotheticalDocument – generate a short (5‑10 line) code snippet based on the request and search for similar code.
Retrieval‑Augmented Generation (RAG) : The DSL is vectorised and used to retrieve relevant code fragments from a vector database.
Indexing Stage
During indexing the framework:
Splits source files into small chunks (≈1500 characters or 40 lines).
Vectorises each chunk with a local SentenceTransformer model (~22 MB).
Stores the vectors in a fast similarity‑search vector database.
The lightweight model runs on CPU, allowing indexing to be integrated into CI/CD pipelines.
Querying Stage
In the querying stage the DSL generated in phase 1 is vectorised and three candidate lists are retrieved:
// English‑based relevant code list
val list = store.findRelevant(query, 15, 0.6)
// Original‑language relevant code list
val originLangList = store.findRelevant(originQuery, 15, 0.6)
// Hypothetical document list
val hydeDocs = store.findRelevant(hypotheticalDocument, 15, 0.6)Results are re‑ranked by a relevance score (e.g., 0.78) together with the canonical class or method name:
0.7847863 // canonicalName: cc.unitmesh.cf.domains.semantic.CodeSemanticWorkflowTest
0.76635444 // canonicalName: cc.unitmesh.cf.domains.semantic.CodeSemanticDecl
...Prompt Strategy 3 – Code Splitting
CF adopts the ArchGuard Scanner approach for chunking, aiming for about 1500 characters (≈40 lines) per chunk while preserving semantic coherence.
Average token‑to‑character ratio ≈ 1:5 (≈300 tokens per chunk).
Chunk size target: 1500 characters ≈ 40 lines ≈ a small‑to‑medium function or class.
Overlap between consecutive chunks: 15 lines (configurable).
CodeSplitter Implementation
class CodeSplitter(
private val comment: String = "//",
private val chunkLines: Int = 40,
private val maxChars: Int = 1500,
private val chunkLinesOverlap: Int = 15
) {
// split(source: CodeDataStruct): List<Document> implementation omitted for brevity
}The split function returns a list of Document objects. If the source code length ≤ maxChars, a single document is returned; otherwise the code is divided into chunks of chunkLines lines, truncating any chunk that exceeds maxChars. Overlap helps preserve context across chunks.
Repository and Documentation
Source code: https://github.com/unit-mesh/chocolate-factory
Documentation site: https://framework.unitmesh.cc/
phodal
A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
