Training an Inference Model Rivaling OpenAI o1 and DeepSeek R1 for Under $50 in 26 Minutes

Researchers from Stanford and Washington trained the s1 inference model in just 26 minutes using under $50 of cloud credits, achieving performance comparable to OpenAI's o1 and DeepSeek's R1 by building a curated 1,000‑sample dataset and a budget‑enforced test‑time scaling algorithm.

Software Engineering 3.0 Era
Software Engineering 3.0 Era
Software Engineering 3.0 Era
Training an Inference Model Rivaling OpenAI o1 and DeepSeek R1 for Under $50 in 26 Minutes

Paper Overview

Stanford and University of Washington AI teams, led by Fei-Fei Li's group, released a paper describing how they trained an inference model (named s1) in 26 minutes with less than $50 of cloud compute credits, achieving performance close to OpenAI’s o1 and DeepSeek’s R1.

Research Idea

The team started from the open‑source Qwen2.5‑32B‑Instruct model, which already excels on mathematical tasks, and sought the simplest path to strong test‑time scaling and inference performance by assembling high‑quality, diverse, and challenging data and applying a simple yet effective test‑time extension method.

Solution: s1K Dataset Construction

They collected 59,029 questions from 16 sources covering math competitions, science queries, riddles, and other domains. The dataset was refined through three key filtering stages:

Quality filtering removed API errors and low‑quality samples.

Difficulty filtering used Qwen2.5‑7B‑Instruct and Qwen2.5‑32B‑Instruct scores and reasoning‑trace length to discard overly easy items.

Diversity filtering applied a mathematical‑topic taxonomy to select questions across 50 fields, yielding the final s1K dataset of 1,000 high‑quality, moderately difficult problems, each paired with reasoning traces distilled from Gemini Thinking Experimental.

Budget‑Enforced Test‑Time Extension Algorithm

The authors introduced a sequential test‑time scaling method that caps the maximum (or minimum) number of thinking tokens. When the cap is reached, an end‑of‑sequence token forces the model to answer; if more computation is desired, the end token is suppressed and a “Wait” token is inserted to encourage further reasoning. Experiments showed that inserting the word “Wait” improves answer accuracy.

Training Procedure

Using Qwen2.5‑32B‑Instruct as the base, they fine‑tuned on s1K with 16 NVIDIA H100 GPUs for 5 epochs, batch size 16, totaling 315 gradient steps. A token separator distinguished thinking from answer phases, and loss was computed only on the reasoning trace and solution. The entire training finished in 26 minutes.

Evaluation Setup and Results

Evaluation employed three reasoning benchmarks: AIME24, MATH500, and GPQA Diamond. s1‑32B was compared against OpenAI o1 series, DeepSeek r1 series, and other state‑of‑the‑art models, measuring controllability, scalability, and raw performance. s1‑32B demonstrated strong test‑time scaling, achieving a 27% higher score than o1‑preview on AIME24 and showing high sample efficiency. The budget‑enforcement algorithm outperformed alternative test‑time scaling methods in controlling compute resources.

Conclusions

Training on just 1,000 curated samples together with the budget‑enforced algorithm effectively boosts inference performance and test‑time scalability. Careful data selection—balancing quality, difficulty, and diversity—is crucial. The work provides a direction for future simple‑inference research and suggests that language models can improve reasoning ability and resource efficiency without massive compute budgets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Qwen2.5AI benchmarkinglow-cost trainingbudget enforcementinference models1 dataset
Software Engineering 3.0 Era
Written by

Software Engineering 3.0 Era

With large models (LLMs) reshaping countless industries, software engineering is leading the charge into the Software Engineering 3.0 era—model-driven development and operations. This account focuses on the new paradigms, theories, and methods of SE 3.0, and showcases its tools and practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.