Why Phi‑4’s 14B Model Outperforms GPT‑4 on STEM and Reasoning Tasks
Microsoft Research’s Phi‑4 model, a 14‑billion‑parameter LLM, leverages extensive synthetic data, advanced tokenization, and a two‑stage training pipeline to achieve superior performance on STEM question answering, long‑context reasoning, and safety benchmarks, rivaling larger models like GPT‑4.
Overview
Phi‑4 is a 14‑billion‑parameter decoder‑only Transformer released by Microsoft Research. It achieves stronger STEM question‑answering and reasoning than GPT‑4 despite its smaller size, primarily due to high‑quality synthetic data and a two‑stage training pipeline.
References
https://arxiv.org/abs/2412.08905
https://huggingface.co/microsoft/phi-4
https://ollama.com/library/phi4Data Processing
Pre‑training relies on 50 synthetic datasets (≈4 000 B unweighted tokens) generated via multi‑agent prompting, self‑revision, and instruction reversal. Synthetic data provides:
Structured progressive learning : challenges are ordered for efficient acquisition.
Alignment with inference format : training examples match the format used at inference time.
High‑quality organic Q&A and web sources are filtered and used as seeds for synthetic generation.
Pre‑training
Model architecture
Decoder‑only Transformer, 14 B parameters, default context 4 096 tokens, extended to 16 K during mid‑training. Tokenizer: tiktoken (enhanced multilingual support). Full attention up to 4 K tokens (no sliding window).
Training schedule
~10 trillion tokens processed with linear warm‑up and decay. Peak learning rate = 0.0003, weight decay = 0.1, global batch size = 5 760. Mid‑training expands context to 16 K, consumes 250 B tokens, raises RoPE base frequency to 250 K, and reduces learning rate by a factor of ten.
Data composition
Token‑level mix:
Synthetic data 40 %
Web‑rewritten data 15 %
Filtered web data 15 %
Targeted organic data (papers, books, forums) 10 %
Code data 20 %
Mid‑training for long context
30 % newly filtered long‑context samples are mixed with 70 % of the original data. Natural long‑context sources outperform artificially created ones. RoPE base frequency increased to 250 K; learning rate reduced tenfold.
Long‑context evaluation
Evaluated with the HELMET suite at 8 K and 16 K context lengths on recall, RAG, re‑ranking, in‑context learning (ICL), QA, and summarization. Phi‑4 excels on recall and RAG tasks.
Post‑training
Supervised fine‑tuning (SFT)
~8 B tokens covering mathematics, coding, reasoning, dialogue, model identity, and safety; multilingual data in 40 languages.
Direct Preference Optimization (DPO)
Two rounds of DPO:
Round 1 (key‑token search‑driven): generates high‑impact preference pairs using the PTS algorithm.
Round 2 (standard DPO): broader scenarios and safety data.
Key‑Token Search (PTS)
Identifies tokens that strongly influence output correctness by sampling outputs, recursively splitting sequences, and inserting the discovered tokens into DPO pairs. This yields precise optimization with reduced noise.
Hallucination mitigation
SFT and DPO data teach the model to refuse answering uncertain questions rather than fabricate answers. Seed questions come from TriviaQA, GPT‑4‑generated correct/incorrect answers, and refusal examples.
Ablation results
Key findings:
SFT improves most benchmarks, especially math and coding.
First‑round DPO (PTS) boosts reasoning‑heavy tasks (GPQA, MATH).
Second‑round DPO improves GPT‑4‑scored tasks (ArenaHard).
Combined DPO rounds provide complementary gains.
Metric improvements: MMLU +2.0, GPQA +8.8, MATH +3.3, HumanEval +3.1.
Benchmark considerations
Standard academic benchmarks have known limitations; an internal PhiBench suite is maintained to probe critical reasoning and knowledge skills.
Safety
Phi‑4 follows Microsoft’s responsible AI principles, with post‑training safety alignment, red‑team testing, and automated evaluations. It performs strongly on RAI safety benchmarks.
Limitations
Remaining issues include occasional factual hallucinations and difficulty adhering to strict formatting instructions.
Conclusion
The combination of large‑scale synthetic data, mid‑training for extended context, and targeted post‑training (SFT, DPO, PTS) demonstrates that compact LLMs can rival much larger models on reasoning, coding, and safety tasks. Future work will focus on further reducing hallucinations and improving format compliance.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
