How C2LLM Redefines Code Retrieval with Attention‑Based Pooling

Introducing C2LLM, a contrastive code LLM series that replaces mean and EOS pooling with a multi‑head attention pooling module, achieving top scores on the MTEB‑Code benchmark across 12 tasks and demonstrating cost‑effective, high‑precision code retrieval for both production and AI agent applications.

PaperAgent
PaperAgent
PaperAgent
How C2LLM Redefines Code Retrieval with Attention‑Based Pooling

Core Challenge: Pooling Bottleneck in Code Semantics

Embedding models are the foundation of Code Retrieval‑Augmented Generation (Code RAG). Traditional text‑embedding pooling methods—mean pooling and end‑of‑sequence (EOS) pooling—lose critical information when applied to long, highly structured code sequences.

Mean Pooling : often combined with bidirectional attention, but mismatches the causal pre‑training of large code models.

EOS Pooling : compresses the entire sequence into the final token, causing severe information loss for functions or long algorithms.

Architecture: Cross‑Attention Adaptive Pooling (PMA)

C2LLM builds on the causal attention backbone of Qwen2.5‑Coder and inserts a lightweight Pooling by Multi‑head Attention (PMA) layer at the top of the model.

Technical Principle

Feature Aggregation : a learnable query vector attends to all token representations, automatically weighting semantically important parts such as function signatures or core loops.

Breaking the Bottleneck : unlike EOS, PMA can focus on multiple key positions simultaneously, improving representation quality for long code.

Dimensional Flexibility : adjusting the PMA projection matrix changes the output embedding dimension without requiring complex training objectives.

PMA module architecture diagram
PMA module architecture diagram

Comprehensive Evaluation on MTEB‑Code (12 Tasks)

C2LLM‑7B achieves a total score of 80.75, ranking first on the leaderboard, while C2LLM‑0.5B sets a new record for sub‑1B models.

MTEB‑Code performance comparison
MTEB‑Code performance comparison

Task Categories

Text‑to‑Code : APPS, CosQA, CodeSearchNet – PMA enables precise matching between natural‑language intent and programming‑language implementation, showing robustness to informal queries.

Code QA : Single‑turn (StackOverflowQA, CodeFeedbackST) and multi‑turn (CodeFeedbackMT) – PMA extracts key technical entities from dialogue history, ensuring highly relevant code suggestions.

Code‑to‑Code & Cross‑Language Retrieval : CodeTransOceanContest, CodeTransOceanDL, CodeSearchNetCCR – C2LLM captures cross‑language logical equivalence beyond keyword matching.

Structured Data & Version Control : SyntheticText2SQL (text‑to‑SQL), CodeEditSearch (text‑to‑code edit), COIRCodeSearchNet (code‑to‑text) – demonstrates retrieval precision for constrained syntax, diff‑style intent, and documentation generation.

Training Strategy

~3 million high‑quality samples covering diverse code‑related tasks.

Contrastive learning with LoRA fine‑tuning, global batch synchronization, and hard negative mining to improve discriminative ability.

Weighted checkpoint averaging to enhance cross‑language generalization.

Conclusions & Outlook

Improving the pooling mechanism unlocks substantial performance gains for code retrieval without dramatically increasing model size. The 0.5 B model offers a cost‑effective alternative that rivals many 7 B models, while the 7 B model provides high‑precision retrieval for complex agent systems.

Resources

Paper/Technical Report: https://arxiv.org/abs/2512.21332

GitHub repository: codefuse-ai/CodeFuse-Embeddings Hugging Face model cards: codefuse-ai/C2LLM-7B,

codefuse-ai/C2LLM-0.5B
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

contrastive learninglarge language modelRetrieval Augmented Generationcode embeddingattention poolingMTEB-Code
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.