How C2LLM Redefines Code Retrieval with Attention‑Based Pooling
Introducing C2LLM, a contrastive code LLM series that replaces mean and EOS pooling with a multi‑head attention pooling module, achieving top scores on the MTEB‑Code benchmark across 12 tasks and demonstrating cost‑effective, high‑precision code retrieval for both production and AI agent applications.
Core Challenge: Pooling Bottleneck in Code Semantics
Embedding models are the foundation of Code Retrieval‑Augmented Generation (Code RAG). Traditional text‑embedding pooling methods—mean pooling and end‑of‑sequence (EOS) pooling—lose critical information when applied to long, highly structured code sequences.
Mean Pooling : often combined with bidirectional attention, but mismatches the causal pre‑training of large code models.
EOS Pooling : compresses the entire sequence into the final token, causing severe information loss for functions or long algorithms.
Architecture: Cross‑Attention Adaptive Pooling (PMA)
C2LLM builds on the causal attention backbone of Qwen2.5‑Coder and inserts a lightweight Pooling by Multi‑head Attention (PMA) layer at the top of the model.
Technical Principle
Feature Aggregation : a learnable query vector attends to all token representations, automatically weighting semantically important parts such as function signatures or core loops.
Breaking the Bottleneck : unlike EOS, PMA can focus on multiple key positions simultaneously, improving representation quality for long code.
Dimensional Flexibility : adjusting the PMA projection matrix changes the output embedding dimension without requiring complex training objectives.
Comprehensive Evaluation on MTEB‑Code (12 Tasks)
C2LLM‑7B achieves a total score of 80.75, ranking first on the leaderboard, while C2LLM‑0.5B sets a new record for sub‑1B models.
Task Categories
Text‑to‑Code : APPS, CosQA, CodeSearchNet – PMA enables precise matching between natural‑language intent and programming‑language implementation, showing robustness to informal queries.
Code QA : Single‑turn (StackOverflowQA, CodeFeedbackST) and multi‑turn (CodeFeedbackMT) – PMA extracts key technical entities from dialogue history, ensuring highly relevant code suggestions.
Code‑to‑Code & Cross‑Language Retrieval : CodeTransOceanContest, CodeTransOceanDL, CodeSearchNetCCR – C2LLM captures cross‑language logical equivalence beyond keyword matching.
Structured Data & Version Control : SyntheticText2SQL (text‑to‑SQL), CodeEditSearch (text‑to‑code edit), COIRCodeSearchNet (code‑to‑text) – demonstrates retrieval precision for constrained syntax, diff‑style intent, and documentation generation.
Training Strategy
~3 million high‑quality samples covering diverse code‑related tasks.
Contrastive learning with LoRA fine‑tuning, global batch synchronization, and hard negative mining to improve discriminative ability.
Weighted checkpoint averaging to enhance cross‑language generalization.
Conclusions & Outlook
Improving the pooling mechanism unlocks substantial performance gains for code retrieval without dramatically increasing model size. The 0.5 B model offers a cost‑effective alternative that rivals many 7 B models, while the 7 B model provides high‑precision retrieval for complex agent systems.
Resources
Paper/Technical Report: https://arxiv.org/abs/2512.21332
GitHub repository: codefuse-ai/CodeFuse-Embeddings Hugging Face model cards: codefuse-ai/C2LLM-7B,
codefuse-ai/C2LLM-0.5BSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
