Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward
By replacing the traditional Pass@1 reward with a Pass@k formulation and a lightweight advantage computation, a 7B language model can dramatically improve its performance on math reasoning benchmarks, surpassing GPT‑4o while adding only a few lines of code and minimal training overhead.
Background
When fine‑tuning large language models with reinforcement learning, the common reward Pass@1 gives +1 only if the single generated answer is correct, otherwise 0. This encourages the model to output the shortest or most frequent answer when uncertain, leading to conservative behavior and local‑optimal solutions that miss rarer, better answers.
Pass@k Reward
The core idea is to let the model generate k answers per query and count the task as solved if any answer is correct. The reward becomes the expected maximum over the k outcomes: Pass@k = E[max(R1, R2, …, Rk)] This change provides three benefits: (1) it encourages exploration because a single correct answer among many wrong ones still yields reward; (2) it creates a natural curriculum by annealing k from large to small; (3) it requires no additional annotation because the verifier logic stays unchanged.
Analytical Advantage Computation
The paper derives an analytical solution that avoids high‑variance Monte‑Carlo sampling. The key function computes an advantage for each rollout based on the number of correct ( pos_mask) and incorrect ( neg_mask) outcomes.
def compute_advantage(k: int, pos_mask: torch.Tensor, neg_mask: torch.Tensor):
"""pos_mask: [B, k] 1 if rollout correct
neg_mask: [B, k] 1 if rollout wrong
returns: advantage for each rollout [B, k]"""
B, _ = pos_mask.shape
n_pos = pos_mask.sum(-1)
n_neg = neg_mask.sum(-1)
n_roll = n_pos + n_neg
log_c_all = torch.lgamma(n_roll + 1) - torch.lgamma(k + 1) - torch.lgamma(n_roll - k + 1)
log_c_neg = torch.lgamma(n_neg + 1) - torch.lgamma(k + 1) - torch.lgamma(n_neg - k + 1)
p_no_pos = torch.exp(log_c_neg - log_c_all) # probability all k are wrong
r_bar = 1 - p_no_pos # expected Pass@k
var = r_bar * (1 - r_bar)
adv_pos = (1 - r_bar) / (var + 1e-8)
adv_neg = -(r_bar - (n_pos >= 1).float()) / (var + 1e-8)
advantage = torch.zeros_like(pos_mask, dtype=torch.float)
advantage[pos_mask.bool()] = adv_pos.unsqueeze(-1).expand_as(pos_mask)[pos_mask.bool()]
advantage[neg_mask.bool()] = adv_neg.unsqueeze(-1).expand_as(neg_mask)[neg_mask.bool()]
return advantageIn the training loop, replace the original REINFORCE reward reward = 0/1 with advantage = compute_advantage(...) while keeping the rest of the RL framework (e.g., OpenRLHF, TRL) unchanged.
Task‑Specific Verifier Example
For a maze‑navigation task, the verifier simply checks whether the trajectory reaches the target location.
def verify_maze(traj: List[str]) -> int:
# traj: ["up", "right", ...]
x, y = 0, 0
for act in traj:
x, y = move(x, y, act)
return int((x, y) == TARGET)Register verify_maze as the reward function; the Pass@k pipeline then works out‑of‑the‑box.
Training Procedure
Steps to upgrade an existing RLHF/RLVR project:
Replace the scalar reward ( reward = float(is_correct)) with advantage = compute_advantage(...).
Schedule k (e.g., start at 8 and decay to 1). A simple linear anneal can be implemented as:
k = max(1, 8 - int(global_step / 1000)) # linear decay3. Run the k rollouts in parallel within the same forward pass by increasing the batch size; no extra communication is required.
Experimental Results
Using the same number of training steps, Qwen‑7B was trained with k annealed from 8 to 1. The increase in training time on a single A100 GPU was only ~6 %.
GSM8K: Pass@1 = 58.4 → Pass@k = 71.2
MATH: Pass@1 = 28.9 → Pass@k = 42.7
Maze: Pass@1 = 65.1 → Pass@k = 83.5
Pitfalls and Recommendations
Setting k too large (e.g., > 16) can cause variance explosion; keep k ≤ 16.
The verifier must be deterministic and noise‑free; otherwise the analytical advantage formula becomes invalid.
Prompt length should be fixed across rollouts to avoid GPU idle time.
Repository
Open‑source implementation: https://github.com/RUCAIBox/Passk_Training
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
