Artificial Intelligence 8 min read

How OASIS Achieves State‑of‑the‑Art Code Search with Just 5M Tokens

Fast.ai's Kwaipilot team unveiled OASIS, a 1.3B‑parameter code‑embedding model that, using only 5 million tokens, outperforms larger OpenAI embeddings across CodeSearchNet, CoSQA and AdvTest benchmarks, thanks to repository‑level program analysis, synthetic data generation, and a fused loss function.

Kuaishou Large Model

Nov 29, 2024

How OASIS Achieves State‑of‑the‑Art Code Search with Just 5M Tokens

What is Code Representation?

As codebases grow, developers rely on efficient code‑retrieval systems. Code Embedding converts code snippets into vector representations, enabling machines to understand code semantics for tasks such as search, repository‑level Q&A, and completion.

What Innovations Does OASIS Use?

OASIS is trained on only 5 M tokens yet surpasses SOTA models by combining several novel techniques:

Repository‑level program analysis: Leveraging function‑call graphs and dependency structures (from the Arise lab) to provide contextual information beyond isolated snippets.

OASIS‑instruct data synthesis: An algorithm that automatically generates high‑quality code‑natural‑language pairs for fine‑grained semantic learning.

Fusion loss function: A multi‑objective loss that simultaneously distinguishes similar samples and captures subtle semantic differences.

How Strong Is OASIS?

Without using any test‑set data for training, OASIS outperforms existing models on major benchmarks (CSN, CoSQA, AdvTest). On average it exceeds OpenAI‑Ada‑002 and even beats larger CodeFuse‑CGE‑Small models despite having only one‑third the parameters.

Benchmark highlights:

CodeSearchNet (CSN): Over 200 k code‑doc pairs across six languages; OASIS achieves the highest retrieval accuracy.

CoSQA: 20 k+ human‑annotated query‑code pairs reflecting real‑world search intent; OASIS leads the leaderboard.

AdvTest: A challenging set of ~20 k samples designed to test code‑understanding difficulty; OASIS again tops performance.

Application Scenarios

OASIS excels in several intelligent‑coding tasks:

Code search: Accurately matches developer queries to relevant snippets, prioritizing code that matches the project’s tech stack.

Code recommendation: Predicts API call sequences and full implementations, improving private‑dialect completion quality.

Intelligent code review: Detects functionally similar but differently implemented code, helping spot potential issues.

Semantic code understanding: Powers Kwaipilot’s RepoChat to extract key logic from legacy or third‑party libraries and generate concise descriptions and call‑graph visualizations.

Open Source and Future Outlook

The OASIS model and its code are fully open‑sourced on Hugging Face, inviting the community to fine‑tune or extend it.

Future plans include:

Release even stronger code‑embedding models.

Publish detailed technical reports and research findings.

Expand model applicability to more downstream scenarios.

Visit the Hugging Face repository at https://huggingface.co/Kwaipilot/OASIS-code-1.3B for downloads and further information.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open-source Benchmark AI Model Code search Code Embedding

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.