Artificial Intelligence 11 min read

Building an Internal Code Knowledge Base with Embedding and AST Interpreter

The author builds an internal code knowledge base for the TDesign Vue‑Next library by scraping documentation, chunking and embedding texts with OpenAI’s ada model into a vector store, then retrieving relevant chunks for LLM answers, and enhances context continuity using a JavaScript AST interpreter, achieving up to 90 % query accuracy and a 20 % productivity boost.

Tencent Cloud Developer

Jul 24, 2023

Building an Internal Code Knowledge Base with Embedding and AST Interpreter

ChatGPT's knowledge cutoff in September 2021 makes it unable to answer questions about newer technologies such as the TDesign component library. To overcome this limitation, the author experiments with building an internal knowledge base that enables natural‑language queries over code documentation.

The article compares two main approaches: (1) fine‑tuning a large language model (LLM) with domain data, which requires heavy GPU resources and long debugging cycles; and (2) using embedding techniques to convert documents into vectors stored in a vector database, then retrieving relevant chunks during query time. The author chooses the embedding approach.

Knowledge‑base construction involves three steps:

Data format – the author adopts a JSON schema inspired by the MrRanedeer‑AI‑Tutor project to describe component usage scenarios and associated code.

Data preparation – source data is scraped from the TDesign Vue‑Next documentation (compatible with Vue 3) and cleaned.

Data vectorization – LangChain’s RecursiveCharacterTextSplitter is used to split texts into chunks, which are then embedded with OpenAI’s text‑ada‑embedding‑002 model and stored in a vector store.

Retrieval and generation – When a user asks a question, the system performs a similarity search in the vector store, retrieves the top‑K relevant chunks, and feeds them together with a system prompt to the LLM to generate an answer.

The author also addresses a common problem: chunking can break semantic continuity, causing loss of context (e.g., the example with "小明的自我介绍"). To mitigate this, the author proposes using a JavaScript AST interpreter to record the scope of each chunk, enabling reconstruction of hierarchical context.

Example of the scope‑tracking code snippet:

>>> startScopeStr: "小明的自我介绍" <<<,我的爱好是足球和绘画 >>> endScopeStr: "" <<<

Results – The embedding solution achieved a query accuracy of up to 90 % and improved developer efficiency by about 20 %. After further AST‑based optimizations, the correct answer rate rose to 83.3 % (25/30) with a usable rate of 90 % (27/30), reducing bad cases from 7 to 3.

Future considerations include improving data quality, standardizing evaluation metrics for embeddings, handling multi‑dimensional and long‑form knowledge, maintaining model performance as the knowledge base grows, and ensuring data security.

The article concludes with a call for developers to try building their own knowledge bases and a reminder to avoid using sensitive data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AST LLM LangChain vector database Embedding Knowledge Base

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.