How Code Graph Model (CGM) Redefines Repository‑Level Code Understanding

The Code Graph Model (CGM) introduced by Ant's multimodal code team integrates repository‑level graph structures into open‑source LLMs, achieving a 44% solve rate on SWE‑bench Lite, eliminating agent dependence, and demonstrating a novel graph‑enhanced code model through multi‑granular graph construction, dual‑modal alignment, and a lightweight GraphRAG framework.

PaperAgent
PaperAgent
PaperAgent
How Code Graph Model (CGM) Redefines Repository‑Level Code Understanding

Code Graph Model (CGM) Overview

At NeurIPS 2025 the Code Graph Model (CGM) was introduced as a novel architecture that integrates repository‑level graph structures into open‑source large language models (LLMs). By aligning code graphs with LLM inputs, CGM achieves a 44.00% solve rate on the SWE‑bench Lite benchmark, the highest among open‑weight models.

Research Motivation

Repository‑level issue‑fix tasks remain challenging for existing solutions, which typically rely on closed‑source LLM agents. These agents suffer from:

Unpredictability : multi‑step planning and tool calls lead to error accumulation.

High computational cost : repeated inference incurs large time and compute overhead.

Privacy and deployment constraints : dependence on proprietary models limits private‑cloud use.

CGM addresses these limitations by fusing repository structure with code semantics, enabling open‑source LLMs to reason about complex software systems.

Core Technology

1️⃣ Multi‑Granular Code Graph Construction

Static program analysis converts an entire code repository into a Code Graph composed of:

Node types (7) : REPO, PACKAGE, FILE, TEXTFILE, CLASS, FUNCTION, ATTRIBUTE.

Edge types (5) : contains (hierarchical), calls (function calls), imports (module imports), extends (class inheritance), implements (interface implementation).

The graph supports multiple inheritance via a CHA algorithm and conservatively resolves dynamic calls to preserve semantic dependencies.

2️⃣ Structure‑Semantic Dual‑Modal Alignment

Semantic Integration (512× context expansion)

Encode each node’s textual content with a pretrained CodeT5+ encoder.

Compress up to 512 tokens per node into a single node token.

Map the node token into the LLM’s embedding space using a dedicated Adapter, effectively extending the usable context length by a factor of 512.

Structural Integration (Graph‑aware Attention Mask)

The adjacency matrix of the code graph is transformed into a graph‑aware attention mask that replaces the standard causal mask. Attention is computed only between nodes that share a direct dependency edge, mimicking message passing in graph neural networks.

3️⃣ Two‑Stage Training Strategy

Stage 1 – Subgraph Reconstruction (Graph‑to‑Code) : Sample subgraphs containing only node types and edges, and train the model to reconstruct the original code snippets, forcing deep understanding of graph‑code correspondence.

Stage 2 – Noise‑augmented Fine‑tuning : Fine‑tune on real GitHub Issue‑PR data, injecting 10% noise (irrelevant or missing files) into prompts to improve robustness and generalization.

Agentless GraphRAG Framework

To streamline inference, CGM is paired with a lightweight GraphRAG pipeline consisting of four modules:

Rewriter : Rewrites user issues to extract key file names and keywords.

Retriever : Combines lexical and semantic search to locate highly relevant subgraphs from the full code graph.

Reranker : Ranks files within the retrieved subgraph and selects the top‑K critical files.

Reader/CGM : Consumes the subgraph structure and selected file contents to generate a high‑quality fix patch in a single pass.

Experimental Results

Repository‑Level Code Fix (SWE‑bench)

CGM achieves a 44.00% solve rate on SWE‑bench Lite, ranking first among open‑weight models.

Repository‑Level Code Completion

On the CrossCodeEval and ComplexCodeEval benchmarks, CGM outperforms same‑size baselines on complex structure completion tasks, demonstrating the generality of the graph‑enhanced architecture.

Resources

Paper: https://arxiv.org/abs/2505.16901

Model checkpoint (72B): https://huggingface.co/codefuse-ai/CodeFuse-CGM-72B

Dataset (CodeGraph): https://huggingface.co/datasets/codefuse-ai/CodeGraph

Source code: https://github.com/codefuse-ai/CodeFuse-CGM

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMsoftware engineeringopen-sourceGraphRAGCode GraphRepository-level
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.