How OpenCoder’s RefineCode Dataset Powers Next‑Gen Code LLMs
The OpenCoder technical report details the creation of the RefineCode dataset, its multi‑stage preprocessing, filtering, and sampling pipelines, the pre‑training and fine‑tuning schedules for 1.5B and 8B models, and the autonomous data selection methods that together achieve performance comparable to Qwen2.5‑Coder.
Overview
OpenCoder released two model sizes (1.5B and 8B) in both base and instruction‑tuned variants. The models are hosted on HuggingFace (https://huggingface.co/OpenCoder-LLM) and aim for performance comparable to Qwen2.5‑Coder.
1. Pre‑training Data – RefineCode
RefineCode combines raw code from GitHub (up to November 2023) with the Stack V2 corpus and additional code‑related web data.
1.1 Raw code processing
Preprocessing – remove non‑text files, keep only files with programming‑language extensions (see https://github.com/github-linguist/linguist/blob/main/lib/linguist/languages.yml), and filter out low‑capacity or low‑quality files.
Deduplication – exact duplicate removal with SHA‑256, followed by fuzzy deduplication using MinHash (5‑gram, 2048 functions) and LSH (band=16, row=128). Repository‑level deduplication retains roughly three times more tokens than file‑level deduplication, though file‑level data gave slightly better performance for the 1.5B model.
Transformation – detect and rewrite problematic patterns such as copyright headers.
Filtering – apply heuristic rules inspired by Textbooks Are All You Need to drop non‑self‑contained, poorly structured, or non‑standard snippets.
Filtering rules are grouped into:
Natural Language Filtering Rules (size, line count, etc.).
General Code Filtering Rules (variable count, function length, etc.).
Language‑Specific Filtering Rules (e.g., Python pass frequency, C goto usage).
Thresholds are initially set by experience and then refined by evaluating perplexity (PPL) on filtered versus retained data.
After de‑duplication and filtering, over‑represented languages are down‑sampled (e.g., Java from 409 GB to 200 GB, HTML from 213 GB to 64 GB), yielding a final corpus of ~730 B tokens. The total RefineCode corpus contains 960 B tokens with a balanced language distribution.
1.2 Code‑related web data
Inspired by DeepSeekMath, OpenCoder harvested code‑related web data using an AutoDS‑style pipeline:
Select 500 k high‑quality code‑like snippets from CommonCrawl.
Train a fastText model (on the hqcode dataset) to score web pages.
Domain‑level deduplication retains roughly 10 % of pages as code‑related.
Iterative manual labeling and re‑scoring expanded the set to 220 GB, later extended to 330 GB from FineWeb, Skypile, and AutoMathText, plus 178 GB from GitHub.
1.3 Annealing data
During the annealing phase, 84 % of data preserves the original RefineCode distribution to avoid catastrophic forgetting, while 16 % adds high‑quality algorithmic and synthetic data:
Algorithmic corpus sampled from pre‑training data containing keywords such as "leetcode" or "def solution".
Synthetic data includes:
High‑Quality Code Snippets generated by prompting LLMs (e.g., Phi series) to create self‑contained functions with test cases; only snippets that pass execution are retained.
Code Textbooks generated by Qwen2‑72B‑Instruct on the hqcode dataset, where the model analyses code and explains related knowledge.
2. Pre‑training Schedule
Learning‑rate schedule: WSD.
Warmup steps: 2000.
Maximum sequence length: 8192 (first 130 k steps used 4096).
Global batch size: 1024 (≈8 M tokens per sample; earlier steps used batch size 2048).
Learning rate: 3e‑4.
Training was performed on 512 H100 GPUs for 187.5 hours.
3. Post‑training (Instruction Tuning) Data
The instruction‑tuning corpus consists of four components:
Open‑source instruction data (Evol‑Instruct, Infinity‑Instruct, McEval) plus a binary classifier to extract code‑related entries from Infinity‑Instruct.
Educational instruction synthesis: a scoring model selects high‑quality seed tasks; prompts generate Python tasks, analyses, solutions, and test cases.
Package‑related instruction synthesis: prompts create up‑to‑date problems and solutions for the latest library APIs, requiring self‑contained problem statements.
Large‑scale diverse instruction synthesis (Mammoth2 approach): seed sentences from web data, random task specification (language, difficulty, type), LLM generation, execution‑based filtering, and code annotation.
3.2 Fine‑tuning Procedure
Fine‑tuning is split into two stages:
Stage 1 focuses on theoretical knowledge (computer fundamentals, algorithms, data structures).
Stage 2 emphasizes practical coding tasks.
Data composition for each stage is illustrated in the original report (image omitted).
4. Autonomous Data Selection (AutoDS)
AutoDS scores data without human labels by prompting an LLM to answer binary (YES/NO) questions about each sample. Multiple binary scores are multiplied to obtain a continuous relevance score.
<system>
You are ChatGPT, equipped with extensive expertise in mathematics and coding, and skilled in complex reasoning and problem‑solving. In the following task, I will present a text excerpt from a website. Your role is to evaluate whether this text exhibits mathematical intelligence and if it is suitable for educational purposes. Please respond with only YES or NO.
</system>
User: {"url": "{url}", "text": "{text}"}
1. Does the text exhibit elements of mathematical intelligence? Respond with YES or NO
2. Is the text suitable for educational purposes? Respond with YES or NO
Assistant: 1.Qwen‑72B‑base is used as the scoring model. Applying AutoDS to multiple source datasets improves the quality of the final training corpus, as shown by downstream performance gains (image omitted).
5. Summary
OpenCoder introduces a detailed data cleaning, deduplication, and synthesis pipeline that yields higher‑quality pre‑training data and improves LLM performance on code tasks. The RefineCode dataset and associated pipelines are intended to be released openly for the research community.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
