How Continual Pre‑Training Boosts Llama‑3’s Chinese and Scientific Reasoning
This report presents a continual pre‑training approach that significantly enhances Llama‑3 (8B)’s Chinese language proficiency and scientific reasoning by using a carefully mixed corpus of existing and synthetic data, detailing the bilingual adaptation and synthetic‑enhancement stages, data‑mixing and curriculum strategies, and demonstrating strong results across multilingual and scientific benchmarks without sacrificing original capabilities.
Introduction
Large language models have driven major AI progress, yet they often suffer knowledge gaps in specific domains. Llama‑3, pretrained mainly on English data, performs poorly on Chinese tasks and shows limited multidisciplinary scientific knowledge. Continual pre‑training (CPT) can address these gaps, but catastrophic forgetting—loss of previously learned abilities—remains a key challenge, especially under limited training budgets.
This technical report describes a CPT pipeline that markedly improves Llama‑3 (8B)’s Chinese language ability and scientific reasoning while preserving its original capabilities. The authors create high‑quality synthetic data, design a data‑mixing strategy, and introduce a curriculum based on perplexity. The resulting model, named Llama‑3‑SynE (Synthetic‑data‑Enhanced Llama‑3), is evaluated on numerous benchmarks, showing large gains (e.g., +8.81 on C‑Eval, +6.31 on CMMLU, +12.00 on MATH, +4.13 on SciEval) without degrading English performance.
Method
The CPT process consists of two stages: a bilingual adaptation stage and a synthetic‑enhancement stage.
Bilingual adaptation stage
Goal : improve Chinese ability while maintaining or improving existing skills.
Data ratio : set Chinese to English corpus proportion at 2:8.
Data strategy :
Topic‑based data mixing – manually label topics, train a classifier to identify topic of web data, and dynamically adjust topic proportions to keep balanced competence.
Perplexity‑based curriculum – gradually increase data complexity according to model perplexity scores.
Synthetic‑enhancement stage
Goal : boost multidisciplinary scientific reasoning and code ability.
Data ratio : adjust Chinese : English : synthetic to 1:7:2.
Data synthesis :
Scientific QA data – generate question‑answer pairs from scientific web content using LLMs.
Code QA data – create new programming problems and solutions from existing code‑question banks.
Details of the CPT pipeline are provided in the accompanying paper.
Experiments and Results
Initial exploratory experiments were conducted on the smaller TinyLlama model to validate the effectiveness of synthetic data and to study data quality, proportion, and curriculum strategies. Findings from TinyLlama guided the CPT of Llama‑3.
Comprehensive evaluation of Llama‑3‑SynE on over a dozen standard and scientific benchmarks demonstrates:
Main benchmarks
On Chinese benchmarks (C‑Eval, CMMLU) Llama‑3‑SynE outperforms the base Llama‑3, confirming strong gains in Chinese language ability.
On English benchmarks (MMLU, MATH, code tasks) performance is comparable to or better than the base model, indicating mitigation of catastrophic forgetting.
Scientific benchmarks
On scientific tests (SciEval, GaoKao, ARC) Llama‑3‑SynE shows notable improvements, especially on Chinese science tasks (e.g., GaoKao biology sub‑score +25.71).
Figures below illustrate the performance gains.
It is worth noting that early experiments revealed difficulty in preserving English and code performance when adapting the model to Chinese, likely due to distribution shifts between pre‑training and CPT data. The proposed method balances new and existing abilities effectively.
Conclusion
The study introduces an efficient CPT method that, through carefully designed data selection, mixing, and curriculum strategies, substantially enhances Llama‑3 (8B)’s Chinese language proficiency and scientific reasoning while retaining its original capabilities. Experimental results validate the method’s effectiveness and efficiency, offering valuable guidance for CPT under limited training budgets.
Paper: https://arxiv.org/abs/2407.18743
GitHub repository: https://github.com/RUC-GSAI/Llama-3-SynE
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
