Artificial Intelligence 7 min read

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

By replacing the traditional Pass@1 reward with a Pass@k formulation and a lightweight advantage computation, a 7B language model can dramatically improve its performance on math reasoning benchmarks, surpassing GPT‑4o while adding only a few lines of code and minimal training overhead.

Baobao Algorithm Notes

Aug 17, 2025

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

Background

When fine‑tuning large language models with reinforcement learning, the common reward Pass@1 gives +1 only if the single generated answer is correct, otherwise 0. This encourages the model to output the shortest or most frequent answer when uncertain, leading to conservative behavior and local‑optimal solutions that miss rarer, better answers.

Pass@k Reward

The core idea is to let the model generate k answers per query and count the task as solved if any answer is correct. The reward becomes the expected maximum over the k outcomes: Pass@k = E[max(R1, R2, …, Rk)] This change provides three benefits: (1) it encourages exploration because a single correct answer among many wrong ones still yields reward; (2) it creates a natural curriculum by annealing k from large to small; (3) it requires no additional annotation because the verifier logic stays unchanged.

Analytical Advantage Computation

The paper derives an analytical solution that avoids high‑variance Monte‑Carlo sampling. The key function computes an advantage for each rollout based on the number of correct ( pos_mask) and incorrect ( neg_mask) outcomes.

def compute_advantage(k: int, pos_mask: torch.Tensor, neg_mask: torch.Tensor):
    """pos_mask: [B, k] 1 if rollout correct
    neg_mask: [B, k] 1 if rollout wrong
    returns: advantage for each rollout [B, k]"""
    B, _ = pos_mask.shape
    n_pos = pos_mask.sum(-1)
    n_neg = neg_mask.sum(-1)
    n_roll = n_pos + n_neg
    log_c_all = torch.lgamma(n_roll + 1) - torch.lgamma(k + 1) - torch.lgamma(n_roll - k + 1)
    log_c_neg = torch.lgamma(n_neg + 1) - torch.lgamma(k + 1) - torch.lgamma(n_neg - k + 1)
    p_no_pos = torch.exp(log_c_neg - log_c_all)   # probability all k are wrong
    r_bar = 1 - p_no_pos                         # expected Pass@k
    var = r_bar * (1 - r_bar)
    adv_pos = (1 - r_bar) / (var + 1e-8)
    adv_neg = -(r_bar - (n_pos >= 1).float()) / (var + 1e-8)
    advantage = torch.zeros_like(pos_mask, dtype=torch.float)
    advantage[pos_mask.bool()] = adv_pos.unsqueeze(-1).expand_as(pos_mask)[pos_mask.bool()]
    advantage[neg_mask.bool()] = adv_neg.unsqueeze(-1).expand_as(neg_mask)[neg_mask.bool()]
    return advantage

In the training loop, replace the original REINFORCE reward reward = 0/1 with advantage = compute_advantage(...) while keeping the rest of the RL framework (e.g., OpenRLHF, TRL) unchanged.

Task‑Specific Verifier Example

For a maze‑navigation task, the verifier simply checks whether the trajectory reaches the target location.

def verify_maze(traj: List[str]) -> int:
    # traj: ["up", "right", ...]
    x, y = 0, 0
    for act in traj:
        x, y = move(x, y, act)
    return int((x, y) == TARGET)

Training Procedure

Steps to upgrade an existing RLHF/RLVR project:

Replace the scalar reward ( reward = float(is_correct)) with advantage = compute_advantage(...).

Schedule k (e.g., start at 8 and decay to 1). A simple linear anneal can be implemented as:

k = max(1, 8 - int(global_step / 1000))  # linear decay

3. Run the k rollouts in parallel within the same forward pass by increasing the batch size; no extra communication is required.

Experimental Results

Using the same number of training steps, Qwen‑7B was trained with k annealed from 8 to 1. The increase in training time on a single A100 GPU was only ~6 %.

GSM8K: Pass@1 = 58.4 → Pass@k = 71.2

MATH: Pass@1 = 28.9 → Pass@k = 42.7

Maze: Pass@1 = 65.1 → Pass@k = 83.5

Pitfalls and Recommendations

Setting k too large (e.g., > 16) can cause variance explosion; keep k ≤ 16.

The verifier must be deterministic and noise‑free; otherwise the analytical advantage formula becomes invalid.

Prompt length should be fixed across rollouts to avoid GPU idle time.

Repository

Open‑source implementation: https://github.com/RUCAIBox/Passk_Training

Python RLHF reward engineering

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.