CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning
The DeepSeek team introduced CODEI/O, a massive dataset that converts code into natural‑language reasoning chains, and demonstrated that training large language models on this data markedly improves their performance on diverse inference tasks, including non‑code domains, through a two‑stage training strategy.
The DeepSeek research team proposed a novel approach that transforms source code into explicit reasoning processes, constructing the CODEI/O dataset with over 3 million training instances and using it to train models such as Qwen and Llama.
To build the dataset, they gathered more than 800 k code files (primarily Python) from sources like CodeMix and PyEdu‑R, standardized them with DeepSeek‑V2.5, added unified entry functions, rule‑based input generators, and concise problem statements, ultimately producing around 400 k structured code documents and 3.5 M input‑output‑reasoning samples.
Each sample combines the formatted function definition, a natural‑language description of its purpose, reference code, and the relevant input or output, which are fed to DeepSeek‑V2.5 to generate a natural‑language chain‑of‑thought (CoT) that explains how the output is derived from the input.
Building on CODEI/O, the team created CODEI/O++ by adding a verification and revision loop: generated responses are re‑executed, incorrect outputs trigger feedback that is incorporated into a second generation round, and the final response consists of the first answer, first‑round feedback, second answer, and second‑round feedback, yielding higher‑quality data.
Model training follows a two‑stage strategy: first, models are trained on CODEI/O or CODEI/O++ to acquire strong reasoning abilities; second, they are fine‑tuned on a general instruction dataset to follow natural‑language commands and perform varied tasks.
Four models—Qwen 2.5‑7B‑Coder, DeepSeek v2‑Lite‑Coder, Llama 3.1‑8B, and Gemma 2‑27B—were evaluated on more than ten benchmarks covering commonsense, mathematics, code, physics, and engineering. All models showed comprehensive performance gains, confirming that reasoning skills learned from code transfer effectively to non‑code tasks, with notable improvements such as a ~150 % boost for Llama on LeetCode‑O.
The first author, Junlong Li, is a master’s student at Shanghai Jiao‑Tong University interning at DeepSeek, supervised by Prof. He Junxian of Hong Kong University of Science and Technology; core researcher Guo Daye also contributed. The paper is available on arXiv, with code and dataset links provided.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.