Repository Intelligence & Context-Aware AI

11 min read

How Chocolate Factory’s Codebase AI Assistant Boosts Code Search with RAG

This article explains the design and implementation of the Codebase AI Assistant in the Chocolate Factory framework, covering its problem‑solving DSL, retrieval‑augmented generation pipeline, indexing and querying stages, prompt strategies, and code‑splitting rules that together enable efficient semantic code search.

phodal

Sep 17, 2023

How Chocolate Factory’s Codebase AI Assistant Boosts Code Search with RAG

The Chocolate Factory (CF) framework now includes a Codebase AI Assistant, an intelligent tool that lets developers query large codebases using natural language. It analyzes questions, retrieves relevant code fragments, and generates concise answers or suggested fixes.

Example Interaction

Question: How is Semantic Workflow implemented?

Answer (generated by the assistant): The workflow creates an ElasticsearchStore, extracts the last user message as the query, uses a SemanticProblemAnalyzer with an LLM provider to parse the problem, executes a semantic code search via SemanticSolutionExecutor, converts the result to a Flowable<Answer>, and returns it.

The execution flow can be visualized with PlantUML:

@startuml
start
:Create ElasticsearchStore object;
:Get last user message as query;
:Analyze problem with SemanticProblemAnalyzer;
:Execute semantic code search with SemanticSolutionExecutor;
:Convert result to Flowable<Answer>;
stop
@enduml

Design Overview

The assistant follows Domain‑Driven Design and operates in two main phases.

Problem Solving (DSL Construction) : An LLM analyses the user question and produces three retrieval conditions: englishQuery – translate non‑English input to English before searching. originLanguageQuery – search using the original language when translation may lose meaning. hypotheticalDocument – generate a short (5‑10 line) code snippet based on the request and search for similar code.

Retrieval‑Augmented Generation (RAG) : The DSL is vectorised and used to retrieve relevant code fragments from a vector database.

Indexing Stage

During indexing the framework:

Splits source files into small chunks (≈1500 characters or 40 lines).

Vectorises each chunk with a local SentenceTransformer model (~22 MB).

Stores the vectors in a fast similarity‑search vector database.

The lightweight model runs on CPU, allowing indexing to be integrated into CI/CD pipelines.

Querying Stage

In the querying stage the DSL generated in phase 1 is vectorised and three candidate lists are retrieved:

// English‑based relevant code list
val list = store.findRelevant(query, 15, 0.6)
// Original‑language relevant code list
val originLangList = store.findRelevant(originQuery, 15, 0.6)
// Hypothetical document list
val hydeDocs = store.findRelevant(hypotheticalDocument, 15, 0.6)

Results are re‑ranked by a relevance score (e.g., 0.78) together with the canonical class or method name:

0.7847863 // canonicalName: cc.unitmesh.cf.domains.semantic.CodeSemanticWorkflowTest
0.76635444 // canonicalName: cc.unitmesh.cf.domains.semantic.CodeSemanticDecl
...

Prompt Strategy 3 – Code Splitting

CF adopts the ArchGuard Scanner approach for chunking, aiming for about 1500 characters (≈40 lines) per chunk while preserving semantic coherence.

Average token‑to‑character ratio ≈ 1:5 (≈300 tokens per chunk).

Chunk size target: 1500 characters ≈ 40 lines ≈ a small‑to‑medium function or class.

Overlap between consecutive chunks: 15 lines (configurable).

CodeSplitter Implementation

class CodeSplitter(
    private val comment: String = "//",
    private val chunkLines: Int = 40,
    private val maxChars: Int = 1500,
    private val chunkLinesOverlap: Int = 15
) {
    // split(source: CodeDataStruct): List<Document> implementation omitted for brevity
}

The split function returns a list of Document objects. If the source code length ≤ maxChars, a single document is returned; otherwise the code is divided into chunks of chunkLines lines, truncating any chunk that exceeds maxChars. Overlap helps preserve context across chunks.

Repository and Documentation

Source code: https://github.com/unit-mesh/chocolate-factory

Documentation site: https://framework.unitmesh.cc/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Vector Database Kotlin Retrieval Augmented Generation Semantic Search AI assistant Code search

Written by

phodal

A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Example Interaction

Design Overview

Indexing Stage

Querying Stage

Prompt Strategy 3 – Code Splitting

CodeSplitter Implementation

Repository and Documentation

phodal

How this landed with the community

Was this worth your time?

0 Comments

Prompt Strategy 3 – Code Splitting