9 min read

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Shanghai AI Laboratory’s InternLM 3.0 upgrade demonstrates that a refined 4 TB token dataset can boost a large‑language model’s performance beyond that of open‑source peers trained on 18 TB, cutting training cost by over 75% while merging regular dialogue with deep reasoning capabilities.

AIWalker

Jan 18, 2025

How InternLM 3.0 Achieves High Performance with Just 4 TB of Training Data

Scaling‑law Regime and Data Efficiency

Large‑model performance follows a scaling‑law relationship where compute and data volume are primary drivers. Shanghai AI Laboratory introduces Intelligence Quality per Token (IQPT) as a metric that captures the ratio of average model performance to the amount of training data. Higher IQPT indicates that each token contributes more learning signal, implying that improving data quality can yield larger gains than merely increasing data size.

InternLM 3.0 (InternLM3‑8B‑Instruct)

Released on 15 January, InternLM 3.0 is trained on a refined corpus of only 4 TB of tokens . Using the data‑refinement pipeline, the model attains comparable effectiveness to open‑source models that consume roughly 18 TB , saving more than 75 % of training cost . IQPT measured on InternLM 3.0 is over four times higher than that of Llama 3.1, demonstrating a superior performance‑per‑token ratio.

Data‑Refinement Framework

Intelligent Data Processing : The raw corpus is partitioned into millions of domains. Autonomous agents perform large‑scale quality inspection, learn from error cases, and apply domain‑specific handling, enabling fine‑grained filtering without manual effort.

High‑Value Data Synthesis : A general model generates candidate data. Candidates are filtered through a tree‑search strategy and multi‑dimensional quality checks (e.g., logical consistency, factual correctness, diversity). The vetted data are then used to fine‑tune a specialist model.

Post‑Training Instruction Synthesis

A world‑knowledge‑tree‑driven pipeline creates tens of thousands of high‑quality instruction examples. Multi‑agent generation extracts real‑world user intents, classifies them into fine‑grained task categories, and synthesizes instruction‑response pairs. These examples are employed for further fine‑tuning, markedly improving conversational fluency.

Evaluation Methodology and Results

Using the open‑source OpenCompass benchmark suite, InternLM 3.0 was evaluated on more than ten authoritative test sets, including CMMLU, GPQA, mathematics, coding, instruction following, long‑text handling, and dialogue. The evaluation protocol follows the reproducible procedures documented by OpenCompass.

Across the majority of benchmarks, InternLM 3.0 outperforms peer open‑source models of similar scale and approaches the performance of GPT‑4o‑mini . For example, on GPQA the model achieves a score 3.2 points higher than Llama 3.1, while on CMMLU it exceeds the baseline by 4.5% absolute accuracy.

General‑Specialist Integration (通专融合)

InternLM 3.0 adopts a “general‑special integration” architecture: a single model can switch between a regular dialogue mode and a deep‑thinking mode via a system prompt. This eliminates the need for separate specialist models. The earlier InternThinker model excelled at deep reasoning but lacked conversational fluency; InternLM 3.0 resolves this trade‑off by jointly training on both data streams and controlling mode selection at inference time.

Demo Scenarios

Solving an arrow‑maze path‑finding puzzle that requires spatial reasoning and algorithmic planning.

Playing classic number‑guessing games, demonstrating multi‑turn logical deduction.

Executing multi‑step web‑browsing tasks with more than 20 navigation steps, showcasing integrated browsing capability.

Release Artifacts

Model code, checkpoints, and training scripts are publicly available at: https://github.com/InternLM/InternLM HuggingFace repository: https://huggingface.co/internlm ModelScope repository:

https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct

Architecture Overview

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language model AI evaluation scaling laws InternLM Data Efficiency

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Scaling‑law Regime and Data Efficiency

InternLM 3.0 (InternLM3‑8B‑Instruct)

Data‑Refinement Framework

Post‑Training Instruction Synthesis

Evaluation Methodology and Results

General‑Specialist Integration (通专融合)

Demo Scenarios

Release Artifacts

Architecture Overview

AIWalker

How this landed with the community

Was this worth your time?

0 Comments

InternLM 3.0 (InternLM3‑8B‑Instruct)