Artificial Intelligence 8 min read

CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning

The DeepSeek team introduced CODEI/O, a massive dataset that converts code into natural‑language reasoning chains, and demonstrated that training large language models on this data markedly improves their performance on diverse inference tasks, including non‑code domains, through a two‑stage training strategy.

DataFunTalk
DataFunTalk
DataFunTalk
CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning

The DeepSeek research team proposed a novel approach that transforms source code into explicit reasoning processes, constructing the CODEI/O dataset with over 3 million training instances and using it to train models such as Qwen and Llama.

To build the dataset, they gathered more than 800 k code files (primarily Python) from sources like CodeMix and PyEdu‑R, standardized them with DeepSeek‑V2.5, added unified entry functions, rule‑based input generators, and concise problem statements, ultimately producing around 400 k structured code documents and 3.5 M input‑output‑reasoning samples.

Each sample combines the formatted function definition, a natural‑language description of its purpose, reference code, and the relevant input or output, which are fed to DeepSeek‑V2.5 to generate a natural‑language chain‑of‑thought (CoT) that explains how the output is derived from the input.

Building on CODEI/O, the team created CODEI/O++ by adding a verification and revision loop: generated responses are re‑executed, incorrect outputs trigger feedback that is incorporated into a second generation round, and the final response consists of the first answer, first‑round feedback, second answer, and second‑round feedback, yielding higher‑quality data.

Model training follows a two‑stage strategy: first, models are trained on CODEI/O or CODEI/O++ to acquire strong reasoning abilities; second, they are fine‑tuned on a general instruction dataset to follow natural‑language commands and perform varied tasks.

Four models—Qwen 2.5‑7B‑Coder, DeepSeek v2‑Lite‑Coder, Llama 3.1‑8B, and Gemma 2‑27B—were evaluated on more than ten benchmarks covering commonsense, mathematics, code, physics, and engineering. All models showed comprehensive performance gains, confirming that reasoning skills learned from code transfer effectively to non‑code tasks, with notable improvements such as a ~150 % boost for Llama on LeetCode‑O.

The first author, Junlong Li, is a master’s student at Shanghai Jiao‑Tong University interning at DeepSeek, supervised by Prof. He Junxian of Hong Kong University of Science and Technology; core researcher Guo Daye also contributed. The paper is available on arXiv, with code and dataset links provided.

large language modelsmodel evaluationdatasetAI trainingcode reasoningCODEI/O
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.