Artificial Intelligence 9 min read

How Continual Pre‑Training Boosts Llama‑3’s Chinese and Scientific Reasoning

This report presents a continual pre‑training approach that significantly enhances Llama‑3 (8B)’s Chinese language proficiency and scientific reasoning by using a carefully mixed corpus of existing and synthetic data, detailing the bilingual adaptation and synthetic‑enhancement stages, data‑mixing and curriculum strategies, and demonstrating strong results across multilingual and scientific benchmarks without sacrificing original capabilities.

NewBeeNLP

Jul 31, 2024

How Continual Pre‑Training Boosts Llama‑3’s Chinese and Scientific Reasoning

Introduction

Large language models have driven major AI progress, yet they often suffer knowledge gaps in specific domains. Llama‑3, pretrained mainly on English data, performs poorly on Chinese tasks and shows limited multidisciplinary scientific knowledge. Continual pre‑training (CPT) can address these gaps, but catastrophic forgetting—loss of previously learned abilities—remains a key challenge, especially under limited training budgets.

This technical report describes a CPT pipeline that markedly improves Llama‑3 (8B)’s Chinese language ability and scientific reasoning while preserving its original capabilities. The authors create high‑quality synthetic data, design a data‑mixing strategy, and introduce a curriculum based on perplexity. The resulting model, named Llama‑3‑SynE (Synthetic‑data‑Enhanced Llama‑3), is evaluated on numerous benchmarks, showing large gains (e.g., +8.81 on C‑Eval, +6.31 on CMMLU, +12.00 on MATH, +4.13 on SciEval) without degrading English performance.

Method

The CPT process consists of two stages: a bilingual adaptation stage and a synthetic‑enhancement stage.

Bilingual adaptation stage

Goal : improve Chinese ability while maintaining or improving existing skills.

Data ratio : set Chinese to English corpus proportion at 2:8.

Data strategy :

Topic‑based data mixing – manually label topics, train a classifier to identify topic of web data, and dynamically adjust topic proportions to keep balanced competence.

Perplexity‑based curriculum – gradually increase data complexity according to model perplexity scores.

Synthetic‑enhancement stage

Goal : boost multidisciplinary scientific reasoning and code ability.

Data ratio : adjust Chinese : English : synthetic to 1:7:2.

Data synthesis :

Scientific QA data – generate question‑answer pairs from scientific web content using LLMs.

Code QA data – create new programming problems and solutions from existing code‑question banks.

Details of the CPT pipeline are provided in the accompanying paper.

Experiments and Results

Initial exploratory experiments were conducted on the smaller TinyLlama model to validate the effectiveness of synthetic data and to study data quality, proportion, and curriculum strategies. Findings from TinyLlama guided the CPT of Llama‑3.

Comprehensive evaluation of Llama‑3‑SynE on over a dozen standard and scientific benchmarks demonstrates:

Main benchmarks

On Chinese benchmarks (C‑Eval, CMMLU) Llama‑3‑SynE outperforms the base Llama‑3, confirming strong gains in Chinese language ability.

On English benchmarks (MMLU, MATH, code tasks) performance is comparable to or better than the base model, indicating mitigation of catastrophic forgetting.

Scientific benchmarks

On scientific tests (SciEval, GaoKao, ARC) Llama‑3‑SynE shows notable improvements, especially on Chinese science tasks (e.g., GaoKao biology sub‑score +25.71).

Figures below illustrate the performance gains.

It is worth noting that early experiments revealed difficulty in preserving English and code performance when adapting the model to Chinese, likely due to distribution shifts between pre‑training and CPT data. The proposed method balances new and existing abilities effectively.

Conclusion

The study introduces an efficient CPT method that, through carefully designed data selection, mixing, and curriculum strategies, substantially enhances Llama‑3 (8B)’s Chinese language proficiency and scientific reasoning while retaining its original capabilities. Experimental results validate the method’s effectiveness and efficiency, offering valuable guidance for CPT under limited training budgets.

Paper: https://arxiv.org/abs/2407.18743

GitHub repository: https://github.com/RUC-GSAI/Llama-3-SynE

LLM benchmarking multilingual NLP Synthetic Data continual pretraining data curriculum Llama-3

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.