Artificial Intelligence 5 min read

PolyCoder: An Open‑Source 27B‑Parameter Code Generation Model Excelling in C Language

Carnegie Mellon researchers introduced PolyCoder, a 27‑billion‑parameter open‑source code generation model built on GPT‑2, trained on 249 GB of multi‑language code and achieving superior performance to Codex in C while remaining competitive across eleven other programming languages.

IT Services Circle
IT Services Circle
IT Services Circle
PolyCoder: An Open‑Source 27B‑Parameter Code Generation Model Excelling in C Language

Carnegie Mellon University researchers released PolyCoder, an open‑source automatic code generation model with 27 billion parameters based on the GPT‑2 architecture, trained on 249 GB of code spanning twelve programming languages, and demonstrating superior performance to all evaluated models, including Codex, in the C language.

In a quoted passage, the authors note that while large code language models have shown great promise, the most advanced ones such as Codex are not publicly available, leaving many design decisions opaque; they aim to fill this gap by systematically evaluating existing open‑source models like Codex, GPT‑J, GPT‑Neo, GPT‑NeoX‑20B, and CodeParrot.

The article also highlights that OpenAI’s Codex, released in August of the previous year, is accessible only via a non‑free API through GitHub Copilot, and DeepMind’s AlphaCode required massive computational resources at Google’s data centers.

Because the strongest models remain closed, their use is limited to well‑funded companies, restricting research opportunities for resource‑constrained organizations.

To address this, the team built PolyCoder using GitHub repositories covering C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala, and TypeScript. The unfiltered dataset totals 631 GB and 38.9 million files; GPT‑2 was chosen as the backbone due to budget constraints.

Although PolyCoder outperforms all models in C, Codex still leads in the other eleven languages.

Further remarks emphasize that PolyCoder surpasses Codex and other models in C, and also beats similarly sized GPT‑Neo 2.7B in JavaScript, Rust, Scala, and TypeScript, while all open‑source models lag behind Codex in the remaining languages.

For full details, see the arXiv paper: https://arxiv.org/pdf/2202.13169.pdf .

A humorous comment from netizens notes that the model’s strength in C leads to the joke “first we have to kill C”.

code generationAIopen-sourcelarge language modelC programmingGPT-2PolyCoder
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.