24 min read

CR Copilot: An Open‑Source LLM‑Based Code Review Assistant with Private Knowledge Base

This article describes the design and implementation of a code‑review assistant powered by open‑source large language models and a privately hosted knowledge base, covering background, pain points, system architecture, model selection, vector‑store integration, prompt engineering, diff parsing, and practical reflections.

ByteFE

Oct 11, 2023

CR Copilot: An Open‑Source LLM‑Based Code Review Assistant with Private Knowledge Base

Background

The idea originated from a Code Review where the author asked Claude which code style was more elegant, sparking the question of whether AI could assist Code Review.

Pain Points

Information security compliance : Directly calling ChatGPT/Claude with internal code raises security concerns; code must be desensitized, which is time‑consuming.

Low‑quality code consumes time : Daily MR volume (10‑20) still requires manual review for logic and business context; automating part of the review can greatly improve efficiency.

Team Code Review standards lack enforcement : Most teams only have documented guidelines that are not strictly enforced by tools.

Introduction

In one sentence: a Code Review practice based on open‑source LLMs + knowledge base, acting as a “CR Copilot”.

Features

Compliant with company security policies – all code data stays within the internal network and inference runs on‑premise.

🌈 Plug‑and‑play : Integrated via GitLab CI with only a few lines of configuration.

🔒 Data security : Private deployment of open‑source LLMs, isolated from external networks.

♾ No call‑limit : Only GPU rental cost on internal platform.

📚 Custom knowledge base : Uses Feishu documents as context, improving review accuracy and aligning with team standards.

🎯 Comments attached to changed lines : Results are posted as line‑specific comments via GitLab CI.

Glossary

Term

Definition

CR / Code Review

Process to ensure code quality and promote knowledge sharing among developers.

LLM / Large Language Model

Neural network models trained on massive text corpora (e.g., GPT, BERT) capable of generating and understanding natural language.

AIGC

Artificial‑Intelligence‑Generated Content, covering text, images, video, etc.

LLaMA

Meta's large multimodal language model.

ChatGLM

Open‑source bilingual dialogue model based on GLM.

Baichuan

Baichuan 2, a new open‑source LLM trained on 2.6 trillion tokens.

Prompt

Text that guides a model to produce a desired output.

LangChain

Python library for building LLM‑centric applications.

Embedding

Mapping text to a fixed‑dimensional vector for semantic similarity.

Vector Database

Database storing vector embeddings for similarity search (e.g., Milvus, Qdrant).

Similarity Search

Finding vectors closest to a query vector.

In‑Context Learning

Providing task‑specific information in the prompt without fine‑tuning.

Finetune

Training a pretrained model on domain‑specific data.

Implementation Approach

Workflow Diagram

System Architecture

To complete a CR cycle, the following technical modules are required:

LLM Selection

The core of CR Copilot is the large language model. The chosen model must:

Understand code.

Have good Chinese support.

Possess strong in‑context learning ability.

Evaluation based on FlagEval’s August leaderboard shows the following candidates:

Model size suffix -{n}b means n*10 billion parameters (e.g., 13b = 130 billion).

Initial candidates were Llama2‑Chinese‑13b‑Chat , chatglm2‑6b , and Baichuan2‑13B‑Chat . Subjective testing favored Llama2 for code‑review tasks, while chatglm2 excelled in Chinese AIGC.

Due to compliance, the default model is ChatGLM2‑6B ; using Llama2 requires a Meta request.

chatglm2‑6b (default)

Llama2‑Chinese‑13b‑Chat (recommended)

Baichuan2‑13B‑Chat

Knowledge‑Base Design

Why a Knowledge Base?

Base models only contain public internet data and lack internal framework knowledge. A knowledge base lets the model understand company‑specific concepts (e.g., the “Lynx” framework).

Finding Relevant Knowledge

Three steps are used:

Text Embeddings

Vector Stores

Similarity Search

Text Embeddings

Semantic matching replaces keyword‑based fuzzy search. The team uses the Chinese model bge-large-zh deployed privately; each embedding takes milliseconds.

Vector Stores

Embeddings are stored in a vector database; the chosen store is Qdrant (Rust implementation, fast).

Vector DB

URL

GitHub Stars

Language

Cloud

chroma

https://github.com/chroma-core/chroma

8.5K

Python

❌

milvus

https://github.com/milvus-io/milvus

22.8K

Go/Python/C++

✅

pinecone

https://www.pinecone.io/

❌

✅

qdrant

https://github.com/qdrant/qdrant

12.7K

Rust

✅

typesense

https://github.com/typesense/typesense

14.4K

C++

❌

weaviate

https://github.com/weaviate/weaviate

7.4K

✅

Similarity Search

Similarity is determined by vector distance; matching the query vector against stored vectors yields the most relevant knowledge.

Loading the Knowledge Base

Two knowledge bases are maintained:

Built‑in official docs (React, TypeScript, Rspack, Garfish, internal Go/Python/Rust guidelines).

Custom Feishu docs loaded via LangChain’s LarkSuite loader, split into chunks using chunk_size and chunk_overlap parameters.

Prompt Design

Because LLMs need clear instructions, the following prompts are used.

Code Summary Prompt

prefix = "user: " if model == "chatglm2" else "<s>Human: "
suffix = "assistant(用中文): let's think step by step." if model == "chatglm2" else "
</s><s>Assistant(用中文): let's think step by step."
return f"""{prefix}根据这段 {language} 代码，列出关于这段 {language} 代码用到的工具库、模块包。
{language} 代码:
```{language}
{source_code}
```
请注意：
- 知识列表中的每一项都不要有类似或者重复的内容
- 列出的内容要和代码密切相关
- 最少列出 3 个, 最多不要超过 6 个
- 知识列表中的每一项要具体
- 列出列表，不要对工具库、模块做解释
- 输出中文
{suffix}"""

Variables: language: file language (TypeScript, Python, Rust, Go, etc.) source_code: full content of the changed file

CR Prompt

Two variants for Llama2 (English input, Chinese output) and ChatGLM2 (Chinese input, Chinese output):

# llama2
f"""Human: please briefly review the {language} code changes by learning the provided context to do a brief code review feedback and suggestions. if any bug risk and improvement suggestion are welcome(no more than six)
<context>
{context}
</context>

<code_changes>
{diff_code}
</code_changes>
</s><s>Assistant: """

# chatglm2
f"""user: 【指令】请根据所提供的上下文信息来简要审查{language} 变更代码，进行简短的代码审查和建议，变更代码有任何 bug 缺陷和改进建议请指出（不超过 6 条）。
【已知信息】：{context}

【变更代码】：{diff_code}

assistant: """

Commenting on Changed Lines

A diff‑parsing function extracts changed line numbers so that the assistant can post line‑specific comments. The implementation is shown below:

import re

def parse_diff(input):
    if not input:
        return []
    if not isinstance(input, str) or re.match(r"^\s+$", input):
        return []
    lines = input.split("
")
    if not lines:
        return []
    result = []
    current_file = None
    current_chunk = None
    deleted_line_counter = 0
    added_line_counter = 0
    current_file_changes = None

    def normal(line):
        nonlocal deleted_line_counter, added_line_counter
        current_chunk["changes"].append({
            "type": "normal",
            "normal": True,
            "ln1": deleted_line_counter,
            "ln2": added_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        added_line_counter += 1
        current_file_changes["old_lines"] -= 1
        current_file_changes["new_lines"] -= 1

    def start(line):
        nonlocal current_file, result
        current_file = {"chunks": [], "deletions": 0, "additions": 0}
        result.append(current_file)

    def to_num_of_lines(number):
        return int(number) if number else 1

    def chunk(line, match):
        nonlocal current_file, current_chunk, deleted_line_counter, added_line_counter, current_file_changes
        if not current_file:
            start(line)
        old_start, old_num_lines, new_start, new_num_lines = (
            match.group(1), match.group(2), match.group(3), match.group(4)
        )
        deleted_line_counter = int(old_start)
        added_line_counter = int(new_start)
        current_chunk = {
            "content": line,
            "changes": [],
            "old_start": int(old_start),
            "old_lines": to_num_of_lines(old_num_lines),
            "new_start": int(new_start),
            "new_lines": to_num_of_lines(new_num_lines),
        }
        current_file_changes = {"old_lines": to_num_of_lines(old_num_lines), "new_lines": to_num_of_lines(new_num_lines)}
        current_file["chunks"].append(current_chunk)

    def delete(line):
        nonlocal deleted_line_counter
        if not current_chunk:
            return
        current_chunk["changes"].append({
            "type": "del",
            "del": True,
            "ln": deleted_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        current_file["deletions"] += 1
        current_file_changes["old_lines"] -= 1

    def add(line):
        nonlocal added_line_counter
        if not current_chunk:
            return
        current_chunk["changes"].append({
            "type": "add",
            "add": True,
            "ln": added_line_counter,
            "content": line
        })
        added_line_counter += 1
        current_file["additions"] += 1
        current_file_changes["new_lines"] -= 1

    def eof(line):
        if not current_chunk:
            return
        most_recent_change = current_chunk["changes"][-1]
        current_chunk["changes"].append({
            "type": most_recent_change["type"],
            most_recent_change["type"]: True,
            "ln1": most_recent_change["ln1"],
            "ln2": most_recent_change["ln2"],
            "ln": most_recent_change["ln"],
            "content": line
        })

    header_patterns = [(re.compile(r"^@@\s+-(\d+),?(\d+)?\s++(\d+),?(\d+)?\s@@"), chunk)]
    content_patterns = [
        (re.compile(r"^\s+No\s+newline\s+at\s+end\s+of\s+file$"), eof),
        (re.compile(r"^-"), delete),
        (re.compile(r"^+"), add),
        (re.compile(r"^\s+"), normal),
    ]

    def parse_content_line(line):
        nonlocal current_file_changes
        for pattern, handler in content_patterns:
            if re.search(pattern, line):
                handler(line)
                break
        if current_file_changes["old_lines"] == 0 and current_file_changes["new_lines"] == 0:
            current_file_changes = None

    def parse_header_line(line):
        for pattern, handler in header_patterns:
            match = re.search(pattern, line)
            if match:
                handler(line, match)
                break

    def parse_line(line):
        if current_file_changes:
            parse_content_line(line)
        else:
            parse_header_line(line)

    for line in lines:
        parse_line(line)
    return result

The bot posts comments via the GitLab API and marks them as Resolved to avoid manual cleanup.

Reflections

Everything is probabilistic : LLM outputs are based on the highest‑probability token sequence, even for seemingly deterministic questions.

Open‑source LLMs + domain knowledge base + private deployment is a viable enterprise practice :

Multiple models are combined; internal knowledge is essential for relevance.

Private deployment alleviates data‑security concerns.

LLMs are often showcased as chat products; real‑world applications require concrete scenario analysis.

AI+ is just beginning : CR Copilot is one of many LLM‑enabled engineering tools the team plans to share.

Join Us

If you are interested in LLMs, front‑end React, back‑end Golang, etc., feel free to apply: https://job.toutiao.com/s/ieD4KuyR

References

[1] Samsung leaked chip secrets after adopting ChatGPT – https://n.news.naver.com/article/243/0000042639

[2] Llama2‑Chinese‑13b‑Chat – https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat

[3] chatglm2‑6b – https://huggingface.co/THUDM/chatglm2-6b

[4] Baichuan2‑13B‑Chat – https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat

[5] Meta model download request – https://ai.meta.com/resources/models-and-libraries/llama-downloads/

[6] Model scores – https://opencompass.org.cn/model-compare/ChatGLM2-6B,LLaMA-2-Chinese-13B,Baichuan2-13B-Chat

[7] chatglm2‑6b – https://huggingface.co/THUDM/chatglm2-6b

[8] Llama2‑Chinese‑13b‑Chat – https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat

[9] Baichuan2‑13B‑Chat – https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat

[10] bge-large-zh – https://huggingface.co/BAAI/bge-large-zh

[11] LarkSuite loader – https://python.langchain.com/docs/integrations/document_loaders/larksuite

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Prompt engineering code review Knowledge Base Vector Store

Written by

ByteFE

Cutting‑edge tech, article sharing, and practical insights from the ByteDance frontend team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.