Why Phi‑4’s 14B Model Outperforms GPT‑4 on STEM and Reasoning Tasks

Microsoft Research’s Phi‑4 model, a 14‑billion‑parameter LLM, leverages extensive synthetic data, advanced tokenization, and a two‑stage training pipeline to achieve superior performance on STEM question answering, long‑context reasoning, and safety benchmarks, rivaling larger models like GPT‑4.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Phi‑4’s 14B Model Outperforms GPT‑4 on STEM and Reasoning Tasks

Overview

Phi‑4 is a 14‑billion‑parameter decoder‑only Transformer released by Microsoft Research. It achieves stronger STEM question‑answering and reasoning than GPT‑4 despite its smaller size, primarily due to high‑quality synthetic data and a two‑stage training pipeline.

References

https://arxiv.org/abs/2412.08905
https://huggingface.co/microsoft/phi-4
https://ollama.com/library/phi4

Data Processing

Pre‑training relies on 50 synthetic datasets (≈4 000 B unweighted tokens) generated via multi‑agent prompting, self‑revision, and instruction reversal. Synthetic data provides:

Structured progressive learning : challenges are ordered for efficient acquisition.

Alignment with inference format : training examples match the format used at inference time.

High‑quality organic Q&A and web sources are filtered and used as seeds for synthetic generation.

Pre‑training

Model architecture

Decoder‑only Transformer, 14 B parameters, default context 4 096 tokens, extended to 16 K during mid‑training. Tokenizer: tiktoken (enhanced multilingual support). Full attention up to 4 K tokens (no sliding window).

Training schedule

~10 trillion tokens processed with linear warm‑up and decay. Peak learning rate = 0.0003, weight decay = 0.1, global batch size = 5 760. Mid‑training expands context to 16 K, consumes 250 B tokens, raises RoPE base frequency to 250 K, and reduces learning rate by a factor of ten.

Data composition

Token‑level mix:

Synthetic data 40 %

Web‑rewritten data 15 %

Filtered web data 15 %

Targeted organic data (papers, books, forums) 10 %

Code data 20 %

Mid‑training for long context

30 % newly filtered long‑context samples are mixed with 70 % of the original data. Natural long‑context sources outperform artificially created ones. RoPE base frequency increased to 250 K; learning rate reduced tenfold.

Long‑context evaluation

Evaluated with the HELMET suite at 8 K and 16 K context lengths on recall, RAG, re‑ranking, in‑context learning (ICL), QA, and summarization. Phi‑4 excels on recall and RAG tasks.

Post‑training

Supervised fine‑tuning (SFT)

~8 B tokens covering mathematics, coding, reasoning, dialogue, model identity, and safety; multilingual data in 40 languages.

Direct Preference Optimization (DPO)

Two rounds of DPO:

Round 1 (key‑token search‑driven): generates high‑impact preference pairs using the PTS algorithm.

Round 2 (standard DPO): broader scenarios and safety data.

Key‑Token Search (PTS)

Identifies tokens that strongly influence output correctness by sampling outputs, recursively splitting sequences, and inserting the discovered tokens into DPO pairs. This yields precise optimization with reduced noise.

Hallucination mitigation

SFT and DPO data teach the model to refuse answering uncertain questions rather than fabricate answers. Seed questions come from TriviaQA, GPT‑4‑generated correct/incorrect answers, and refusal examples.

Ablation results

Key findings:

SFT improves most benchmarks, especially math and coding.

First‑round DPO (PTS) boosts reasoning‑heavy tasks (GPQA, MATH).

Second‑round DPO improves GPT‑4‑scored tasks (ArenaHard).

Combined DPO rounds provide complementary gains.

Metric improvements: MMLU +2.0, GPQA +8.8, MATH +3.3, HumanEval +3.1.

Benchmark considerations

Standard academic benchmarks have known limitations; an internal PhiBench suite is maintained to probe critical reasoning and knowledge skills.

Safety

Phi‑4 follows Microsoft’s responsible AI principles, with post‑training safety alignment, red‑team testing, and automated evaluations. It performs strongly on RAI safety benchmarks.

Limitations

Remaining issues include occasional factual hallucinations and difficulty adhering to strict formatting instructions.

Conclusion

The combination of large‑scale synthetic data, mid‑training for extended context, and targeted post‑training (SFT, DPO, PTS) demonstrates that compact LLMs can rival much larger models on reasoning, coding, and safety tasks. Future work will focus on further reducing hallucinations and improving format compliance.

benchmarkingAI safetySynthetic DataPhi-4training techniques
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.