How Code Graph Model (CGM) Redefines Repository‑Level Code Understanding
The Code Graph Model (CGM) introduced by Ant's multimodal code team integrates repository‑level graph structures into open‑source LLMs, achieving a 44% solve rate on SWE‑bench Lite, eliminating agent dependence, and demonstrating a novel graph‑enhanced code model through multi‑granular graph construction, dual‑modal alignment, and a lightweight GraphRAG framework.
Code Graph Model (CGM) Overview
At NeurIPS 2025 the Code Graph Model (CGM) was introduced as a novel architecture that integrates repository‑level graph structures into open‑source large language models (LLMs). By aligning code graphs with LLM inputs, CGM achieves a 44.00% solve rate on the SWE‑bench Lite benchmark, the highest among open‑weight models.
Research Motivation
Repository‑level issue‑fix tasks remain challenging for existing solutions, which typically rely on closed‑source LLM agents. These agents suffer from:
Unpredictability : multi‑step planning and tool calls lead to error accumulation.
High computational cost : repeated inference incurs large time and compute overhead.
Privacy and deployment constraints : dependence on proprietary models limits private‑cloud use.
CGM addresses these limitations by fusing repository structure with code semantics, enabling open‑source LLMs to reason about complex software systems.
Core Technology
1️⃣ Multi‑Granular Code Graph Construction
Static program analysis converts an entire code repository into a Code Graph composed of:
Node types (7) : REPO, PACKAGE, FILE, TEXTFILE, CLASS, FUNCTION, ATTRIBUTE.
Edge types (5) : contains (hierarchical), calls (function calls), imports (module imports), extends (class inheritance), implements (interface implementation).
The graph supports multiple inheritance via a CHA algorithm and conservatively resolves dynamic calls to preserve semantic dependencies.
2️⃣ Structure‑Semantic Dual‑Modal Alignment
Semantic Integration (512× context expansion)
Encode each node’s textual content with a pretrained CodeT5+ encoder.
Compress up to 512 tokens per node into a single node token.
Map the node token into the LLM’s embedding space using a dedicated Adapter, effectively extending the usable context length by a factor of 512.
Structural Integration (Graph‑aware Attention Mask)
The adjacency matrix of the code graph is transformed into a graph‑aware attention mask that replaces the standard causal mask. Attention is computed only between nodes that share a direct dependency edge, mimicking message passing in graph neural networks.
3️⃣ Two‑Stage Training Strategy
Stage 1 – Subgraph Reconstruction (Graph‑to‑Code) : Sample subgraphs containing only node types and edges, and train the model to reconstruct the original code snippets, forcing deep understanding of graph‑code correspondence.
Stage 2 – Noise‑augmented Fine‑tuning : Fine‑tune on real GitHub Issue‑PR data, injecting 10% noise (irrelevant or missing files) into prompts to improve robustness and generalization.
Agentless GraphRAG Framework
To streamline inference, CGM is paired with a lightweight GraphRAG pipeline consisting of four modules:
Rewriter : Rewrites user issues to extract key file names and keywords.
Retriever : Combines lexical and semantic search to locate highly relevant subgraphs from the full code graph.
Reranker : Ranks files within the retrieved subgraph and selects the top‑K critical files.
Reader/CGM : Consumes the subgraph structure and selected file contents to generate a high‑quality fix patch in a single pass.
Experimental Results
Repository‑Level Code Fix (SWE‑bench)
CGM achieves a 44.00% solve rate on SWE‑bench Lite, ranking first among open‑weight models.
Repository‑Level Code Completion
On the CrossCodeEval and ComplexCodeEval benchmarks, CGM outperforms same‑size baselines on complex structure completion tasks, demonstrating the generality of the graph‑enhanced architecture.
Resources
Paper: https://arxiv.org/abs/2505.16901
Model checkpoint (72B): https://huggingface.co/codefuse-ai/CodeFuse-CGM-72B
Dataset (CodeGraph): https://huggingface.co/datasets/codefuse-ai/CodeGraph
Source code: https://github.com/codefuse-ai/CodeFuse-CGM
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
