CR Copilot: An Open‑Source LLM‑Based Code Review Assistant with Private Knowledge Base
This article describes the design and implementation of a code‑review assistant powered by open‑source large language models and a privately hosted knowledge base, covering background, pain points, system architecture, model selection, vector‑store integration, prompt engineering, diff parsing, and practical reflections.
Background
The idea originated from a Code Review where the author asked Claude which code style was more elegant, sparking the question of whether AI could assist Code Review.
Pain Points
Information security compliance : Directly calling ChatGPT/Claude with internal code raises security concerns; code must be desensitized, which is time‑consuming.
Low‑quality code consumes time : Daily MR volume (10‑20) still requires manual review for logic and business context; automating part of the review can greatly improve efficiency.
Team Code Review standards lack enforcement : Most teams only have documented guidelines that are not strictly enforced by tools.
Introduction
In one sentence: a Code Review practice based on open‑source LLMs + knowledge base, acting as a “CR Copilot”.
Features
Compliant with company security policies – all code data stays within the internal network and inference runs on‑premise.
🌈 Plug‑and‑play : Integrated via GitLab CI with only a few lines of configuration.
🔒 Data security : Private deployment of open‑source LLMs, isolated from external networks.
♾ No call‑limit : Only GPU rental cost on internal platform.
📚 Custom knowledge base : Uses Feishu documents as context, improving review accuracy and aligning with team standards.
🎯 Comments attached to changed lines : Results are posted as line‑specific comments via GitLab CI.
Glossary
Term
Definition
CR / Code Review
Process to ensure code quality and promote knowledge sharing among developers.
LLM / Large Language Model
Neural network models trained on massive text corpora (e.g., GPT, BERT) capable of generating and understanding natural language.
AIGC
Artificial‑Intelligence‑Generated Content, covering text, images, video, etc.
LLaMA
Meta's large multimodal language model.
ChatGLM
Open‑source bilingual dialogue model based on GLM.
Baichuan
Baichuan 2, a new open‑source LLM trained on 2.6 trillion tokens.
Prompt
Text that guides a model to produce a desired output.
LangChain
Python library for building LLM‑centric applications.
Embedding
Mapping text to a fixed‑dimensional vector for semantic similarity.
Vector Database
Database storing vector embeddings for similarity search (e.g., Milvus, Qdrant).
Similarity Search
Finding vectors closest to a query vector.
In‑Context Learning
Providing task‑specific information in the prompt without fine‑tuning.
Finetune
Training a pretrained model on domain‑specific data.
Implementation Approach
Workflow Diagram
System Architecture
To complete a CR cycle, the following technical modules are required:
LLM Selection
The core of CR Copilot is the large language model. The chosen model must:
Understand code.
Have good Chinese support.
Possess strong in‑context learning ability.
Evaluation based on FlagEval’s August leaderboard shows the following candidates:
Model size suffix -{n}b means n*10 billion parameters (e.g., 13b = 130 billion).
Initial candidates were Llama2‑Chinese‑13b‑Chat , chatglm2‑6b , and Baichuan2‑13B‑Chat . Subjective testing favored Llama2 for code‑review tasks, while chatglm2 excelled in Chinese AIGC.
Due to compliance, the default model is ChatGLM2‑6B ; using Llama2 requires a Meta request.
chatglm2‑6b (default)
Llama2‑Chinese‑13b‑Chat (recommended)
Baichuan2‑13B‑Chat
Knowledge‑Base Design
Why a Knowledge Base?
Base models only contain public internet data and lack internal framework knowledge. A knowledge base lets the model understand company‑specific concepts (e.g., the “Lynx” framework).
Finding Relevant Knowledge
Three steps are used:
Text Embeddings
Vector Stores
Similarity Search
Text Embeddings
Semantic matching replaces keyword‑based fuzzy search. The team uses the Chinese model bge-large-zh deployed privately; each embedding takes milliseconds.
Vector Stores
Embeddings are stored in a vector database; the chosen store is Qdrant (Rust implementation, fast).
Vector DB
URL
GitHub Stars
Language
Cloud
chroma
https://github.com/chroma-core/chroma
8.5K
Python
❌
milvus
https://github.com/milvus-io/milvus
22.8K
Go/Python/C++
✅
pinecone
https://www.pinecone.io/
❌
❌
✅
qdrant
https://github.com/qdrant/qdrant
12.7K
Rust
✅
typesense
https://github.com/typesense/typesense
14.4K
C++
❌
weaviate
https://github.com/weaviate/weaviate
7.4K
Go
✅
Similarity Search
Similarity is determined by vector distance; matching the query vector against stored vectors yields the most relevant knowledge.
Loading the Knowledge Base
Two knowledge bases are maintained:
Built‑in official docs (React, TypeScript, Rspack, Garfish, internal Go/Python/Rust guidelines).
Custom Feishu docs loaded via LangChain’s LarkSuite loader, split into chunks using chunk_size and chunk_overlap parameters.
Prompt Design
Because LLMs need clear instructions, the following prompts are used.
Code Summary Prompt
prefix = "user: " if model == "chatglm2" else "
Human: "
suffix = "assistant(用中文): let's think step by step." if model == "chatglm2" else "\n
Assistant(用中文): let's think step by step."
return f"""{prefix}根据这段 {language} 代码,列出关于这段 {language} 代码用到的工具库、模块包。\n{language} 代码:\n```{language}\n{source_code}\n```\n请注意:\n- 知识列表中的每一项都不要有类似或者重复的内容\n- 列出的内容要和代码密切相关\n- 最少列出 3 个, 最多不要超过 6 个\n- 知识列表中的每一项要具体\n- 列出列表,不要对工具库、模块做解释\n- 输出中文\n{suffix}"""Variables:
language : file language (TypeScript, Python, Rust, Go, etc.)
source_code : full content of the changed file
CR Prompt
Two variants for Llama2 (English input, Chinese output) and ChatGLM2 (Chinese input, Chinese output):
# llama2
f"""Human: please briefly review the {language} code changes by learning the provided context to do a brief code review feedback and suggestions. if any bug risk and improvement suggestion are welcome(no more than six)\n
\n{context}\n
\n\n
\n{diff_code}\n
\n
Assistant: """
# chatglm2
f"""user: 【指令】请根据所提供的上下文信息来简要审查{language} 变更代码,进行简短的代码审查和建议,变更代码有任何 bug 缺陷和改进建议请指出(不超过 6 条)。\n【已知信息】:{context}\n\n【变更代码】:{diff_code}\n\nassistant: """Commenting on Changed Lines
A diff‑parsing function extracts changed line numbers so that the assistant can post line‑specific comments. The implementation is shown below:
import re
def parse_diff(input):
if not input:
return []
if not isinstance(input, str) or re.match(r"^\s+$", input):
return []
lines = input.split("\n")
if not lines:
return []
result = []
current_file = None
current_chunk = None
deleted_line_counter = 0
added_line_counter = 0
current_file_changes = None
def normal(line):
nonlocal deleted_line_counter, added_line_counter
current_chunk["changes"].append({
"type": "normal",
"normal": True,
"ln1": deleted_line_counter,
"ln2": added_line_counter,
"content": line
})
deleted_line_counter += 1
added_line_counter += 1
current_file_changes["old_lines"] -= 1
current_file_changes["new_lines"] -= 1
def start(line):
nonlocal current_file, result
current_file = {"chunks": [], "deletions": 0, "additions": 0}
result.append(current_file)
def to_num_of_lines(number):
return int(number) if number else 1
def chunk(line, match):
nonlocal current_file, current_chunk, deleted_line_counter, added_line_counter, current_file_changes
if not current_file:
start(line)
old_start, old_num_lines, new_start, new_num_lines = (
match.group(1), match.group(2), match.group(3), match.group(4)
)
deleted_line_counter = int(old_start)
added_line_counter = int(new_start)
current_chunk = {
"content": line,
"changes": [],
"old_start": int(old_start),
"old_lines": to_num_of_lines(old_num_lines),
"new_start": int(new_start),
"new_lines": to_num_of_lines(new_num_lines),
}
current_file_changes = {"old_lines": to_num_of_lines(old_num_lines), "new_lines": to_num_of_lines(new_num_lines)}
current_file["chunks"].append(current_chunk)
def delete(line):
nonlocal deleted_line_counter
if not current_chunk:
return
current_chunk["changes"].append({
"type": "del",
"del": True,
"ln": deleted_line_counter,
"content": line
})
deleted_line_counter += 1
current_file["deletions"] += 1
current_file_changes["old_lines"] -= 1
def add(line):
nonlocal added_line_counter
if not current_chunk:
return
current_chunk["changes"].append({
"type": "add",
"add": True,
"ln": added_line_counter,
"content": line
})
added_line_counter += 1
current_file["additions"] += 1
current_file_changes["new_lines"] -= 1
def eof(line):
if not current_chunk:
return
most_recent_change = current_chunk["changes"][-1]
current_chunk["changes"].append({
"type": most_recent_change["type"],
most_recent_change["type"]: True,
"ln1": most_recent_change["ln1"],
"ln2": most_recent_change["ln2"],
"ln": most_recent_change["ln"],
"content": line
})
header_patterns = [(re.compile(r"^@@\s+-(\d+),?(\d+)?\s++(\d+),?(\d+)?\s@@"), chunk)]
content_patterns = [
(re.compile(r"^\s+No\s+newline\s+at\s+end\s+of\s+file$"), eof),
(re.compile(r"^-"), delete),
(re.compile(r"^+"), add),
(re.compile(r"^\s+"), normal),
]
def parse_content_line(line):
nonlocal current_file_changes
for pattern, handler in content_patterns:
if re.search(pattern, line):
handler(line)
break
if current_file_changes["old_lines"] == 0 and current_file_changes["new_lines"] == 0:
current_file_changes = None
def parse_header_line(line):
for pattern, handler in header_patterns:
match = re.search(pattern, line)
if match:
handler(line, match)
break
def parse_line(line):
if current_file_changes:
parse_content_line(line)
else:
parse_header_line(line)
for line in lines:
parse_line(line)
return resultThe bot posts comments via the GitLab API and marks them as Resolved to avoid manual cleanup.
Reflections
Everything is probabilistic : LLM outputs are based on the highest‑probability token sequence, even for seemingly deterministic questions.
Open‑source LLMs + domain knowledge base + private deployment is a viable enterprise practice : Multiple models are combined; internal knowledge is essential for relevance. Private deployment alleviates data‑security concerns. LLMs are often showcased as chat products; real‑world applications require concrete scenario analysis.
AI+ is just beginning : CR Copilot is one of many LLM‑enabled engineering tools the team plans to share.
Join Us
If you are interested in LLMs, front‑end React, back‑end Golang, etc., feel free to apply: https://job.toutiao.com/s/ieD4KuyR
References
[1] Samsung leaked chip secrets after adopting ChatGPT – https://n.news.naver.com/article/243/0000042639
[2] Llama2‑Chinese‑13b‑Chat – https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat
[3] chatglm2‑6b – https://huggingface.co/THUDM/chatglm2-6b
[4] Baichuan2‑13B‑Chat – https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat
[5] Meta model download request – https://ai.meta.com/resources/models-and-libraries/llama-downloads/
[6] Model scores – https://opencompass.org.cn/model-compare/ChatGLM2-6B,LLaMA-2-Chinese-13B,Baichuan2-13B-Chat
[7] chatglm2‑6b – https://huggingface.co/THUDM/chatglm2-6b
[8] Llama2‑Chinese‑13b‑Chat – https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat
[9] Baichuan2‑13B‑Chat – https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat
[10] bge-large-zh – https://huggingface.co/BAAI/bge-large-zh
[11] LarkSuite loader – https://python.langchain.com/docs/integrations/document_loaders/larksuite
ByteFE
Cutting‑edge tech, article sharing, and practical insights from the ByteDance frontend team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.