InternLM 3.0: Boosting Model Performance with Only 4 TB of Training Data

Shanghai AI Laboratory’s InternLM 3.0 upgrade demonstrates that refining data quality—measured as intelligence‑per‑token—can replace massive datasets, achieving higher reasoning and dialogue capabilities with just 4 TB of tokens, cutting training cost by over 75 % while approaching GPT‑4‑level performance.

AIWalker
AIWalker
AIWalker
InternLM 3.0: Boosting Model Performance with Only 4 TB of Training Data

On January 15, Shanghai AI Laboratory released InternLM 3.0 (InternLM3‑8B‑Instruct), a major version upgrade that leverages a refined data‑processing framework to dramatically improve data efficiency. By training on only 4 TB of tokens, the model matches or exceeds the performance of open‑source models that use up to 18 TB, reducing training cost by more than 75 %.

The research team introduced the concept of “thinking density” (IQPT – Intelligence Quality per Token) as a metric for data quality. IQPT is defined as the ratio of average model performance to the amount of training data, capturing logical richness, complexity, and insightfulness of the data. Compared with leading open‑source models, InternLM 3.0’s IQPT is over four times higher than the baseline Llama 3.1.

The data‑refinement framework consists of two core components:

Intelligent data processing: The corpus is split into tens of millions of domains. An autonomous agent performs large‑scale automatic quality inspection, learns from error cases, and applies domain‑specific customizations that would be infeasible for human annotators.

High‑value data synthesis: Using a “general‑specialist fusion” approach, a general model iteratively generates synthetic data, which is then filtered and used to train a specialist model. The pipeline employs tree‑search strategies and multi‑dimensional quality verification to produce abundant, reliable high‑value samples.

Evaluation was carried out with the open‑source OpenCompass benchmark, applying a reproducible protocol across more than ten authoritative test suites such as CMMLU and GPQA. The assessment covered reasoning, mathematics, coding, instruction following, long‑text handling, dialogue, and overall performance. InternLM 3.0 consistently outperformed same‑size open‑source competitors and approached the performance of GPT‑4o‑mini.

A key research goal was the fusion of deep reasoning and ordinary dialogue within a single model. InternLM 3.0 is the first general‑purpose model to combine these capabilities, enabling one‑click switching via system prompts. This was achieved by merging data from both reasoning‑heavy and conversational domains, and by fine‑tuning on a massive, multi‑task instruction set derived from a World Knowledge Tree and multi‑agent generated responses.

In the post‑training stage, the team built a task‑driven synthetic data pipeline that extracts real‑world instructions, augments them with generated variants, and classifies them into dozens of fine‑grained scenarios, resulting in hundreds of thousands of high‑quality instruction examples that further improve dialogue experience.

Demonstrations show the model solving complex puzzles such as an arrow‑maze navigation task—requiring spatial reasoning and algorithmic planning—where even OpenAI’s o1 model struggles. The model also handles classic number‑guessing games and performs multi‑step web‑browsing tasks, completing over 20 page transitions to recommend real‑estate listings.

Beyond algorithmic advances, the laboratory emphasizes open‑source collaboration. InternLM’s code, model checkpoints, and evaluation scripts are hosted on GitHub (https://github.com/InternLM/InternLM), Hugging Face (https://huggingface.co/internlm), and ModelScope (https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct). Partnerships with hardware vendors such as Ascend, Cambricon, and Muxi enable efficient fine‑tuning and inference on emerging accelerators.

Evaluation chart
Evaluation chart

Overall, InternLM 3.0 showcases how a data‑centric approach—focusing on quality rather than sheer quantity—can break the scaling‑law bottleneck, delivering high‑performance, versatile AI with substantially lower resource consumption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelModel EvaluationAI researchInternLMdata efficiency
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.