Artificial Intelligence 34 min read

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

Baobao Algorithm Notes

Sep 29, 2024

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

Recent OpenAI releases of the o1 model have sparked intense discussion about its training methodology, especially the test/inference‑time scaling law that allocates compute to the inference phase rather than pre‑training.

1. What Is Test/Inference‑Time Scaling Law?

When a base generator model lacks strong reasoning ability, one can improve it either by expanding pre‑training data and parameters or by spending compute during inference. The o1 report shows that compute is split between two stages:

Post‑training (RLHF) : fine‑tune the model after pre‑training.

Inference (Test/Inference) : allocate compute to the inference stage.

The scaling law suggests that, similar to pre‑training scaling, inference performance improves with more compute, but the influencing factors differ.

2. Framework vs. Detail Variants

The author categorises related works into two groups:

Framework‑level research : introduces concepts such as test/inference‑time scaling and provides generic practical recipes.

Detail‑level variants : build on the framework with specific algorithmic tweaks.

This article focuses on the framework, selecting two representative papers:

"Let's Verify Step by Step" (OpenAI)

"Scaling LLM Test‑Time Compute Optimally can be More Effective than Scaling Model Parameters" (DeepMind)

2.1 Optimising Inference Input – Prompt Engineering

Providing detailed prompts or multi‑turn instructions (e.g., Chain‑of‑Thought prompting) encourages the model to generate reasoning steps before the final answer, which consumes more tokens and thus more inference compute.

2.2 Optimising Inference Output – Revising the Output Distribution

Two approaches are discussed:

Directly train the model with supervised fine‑tuning (SFT) on high‑quality "attempt" data, treating the whole pipeline as post‑training.

Use a verifier (reward model) during inference to guide the generator toward better intermediate steps.

Both can be viewed as post‑training, but the second also spends compute during inference.

3. Method 1 – PRM‑Guided Search

The Process‑supervised Reward Model (PRM) evaluates intermediate reasoning steps. The workflow consists of:

Format training : SFT the model to output "step + answer" format.

Train PRM : collect labelled step data (via human annotation or automated soft labels) and train a binary classifier that predicts step quality.

Inference with PRM : use the trained PRM to score candidate reasoning chains and guide search.

3.1 Data Collection for PRM

Three budget tiers are described:

Super‑rich : generate massive step data, label manually.

Rich : generate moderate data, filter obvious failures, then label.

Average : generate data, estimate step quality via Monte‑Carlo rollouts, and use these soft labels for training.

3.2 Search Strategies Using PRM

Three common search methods are compared:

Best‑of‑N : sample N candidates, score each with PRM (using prod, min, or last‑step aggregation), and pick the highest.

Beam Search : iteratively sample a small set of next steps, keep top‑M according to PRM, and continue until a stopping condition.

Lookahead Search : similar to beam search but evaluates K steps ahead before pruning; essentially a variant of Monte‑Carlo Tree Search (MCTS) where PRM replaces the exploration component.

Experimental results show that with limited search budget, beam search outperforms best‑of‑N, while with large budgets best‑of‑N catches up. For easy problems, simple best‑of‑N suffices; for harder problems, beam search or combined sequential‑parallel methods work better. Complex lookahead often underperforms when PRM is already strong.

4. Method 2 – Directly Adjusting the Model’s Output Distribution

Instead of a separate verifier, one can train a model to generate high‑quality reasoning steps directly. The key challenge is constructing high‑quality SFT data:

Start from supervised "question‑answer" pairs.

Fine‑tune the model to output "step + answer" format (format‑only SFT).

Sample many attempts per question, filter out invalid outputs.

Identify correct attempts (correct final answer) and pair each with several similar‑but‑incorrect attempts to form a trajectory:

(question, wrong‑attempt‑1, …, wrong‑attempt‑k, correct‑attempt)

Train the model on these trajectories so it learns to correct its own mistakes.

Training stops based on validation loss trends, but because the validation set is off‑policy, loss may rise while performance still improves; early stopping is applied after observing over‑fitting signs.

4.1 Choosing the Best Generation Method

Even a well‑trained SFT model may produce incorrect final attempts. Therefore, combining the model with a verifier and a search method (e.g., sequential best‑of‑N, parallel best‑of‑N, or sequential + parallel hybrid) yields better results. Experiments indicate:

For easy questions, sequential selection of the best attempt works best.

For hard questions, a hybrid sequential + parallel approach with tuned hyper‑parameters performs better.

5. Pre‑training vs. Inference Allocation

The article compares allocating the same FLOPs to pre‑training versus inference. Graphs (included as images) show that for simple problems, dedicating all compute to inference yields higher accuracy, while for complex problems, allocating some compute to further pre‑training improves performance.