Dual Engine for Training and Inference: How Princeton’s SD‑ZERO and AggAgent Redefine Complex Reasoning

The article reviews two recent Princeton papers—SD‑ZERO, which introduces self‑revision training and on‑policy self‑distillation to turn a model’s own error traces into dense supervision, and AggAgent, which actively aggregates parallel long‑horizon trajectories—showing how internal trajectory mining can cut compute costs and boost accuracy on challenging math and code benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Dual Engine for Training and Inference: How Princeton’s SD‑ZERO and AggAgent Redefine Complex Reasoning

Training Engine: SD‑ZERO Implements Self‑Revision Supervision

Large language models often rely on external high‑quality labels or powerful models such as GPT‑4 for knowledge distillation, which is costly and unsustainable. SD‑ZERO tackles this bottleneck by letting a single model act as both generator and reviser. In the Self‑Revision Training (SRT) stage, the model samples multiple initial answers to a problem, uses an external validator to label them as correct or incorrect, and then receives distinct prompts: correct answers are re‑phrased, while incorrect ones trigger a “start over” revision. The filtered high‑quality correction traces become the supervision data for subsequent fine‑tuning.

The second stage, on‑policy self‑distillation, freezes the SRT model as a teacher and trains a student generator to match the teacher’s probability distribution via a token‑level KL loss. This enables the model to produce complete reasoning chains even when only the input prompt is provided. Experiments on Qwen3‑4B‑Instruct and Olmo‑3‑7B‑Instruct with a 15K‑sample budget show that SD‑ZERO outperforms standard supervised fine‑tuning (SFT), rejection sampling (RFT), GRPO, and self‑distillation fine‑tuning (SDFT) on AIME, MATH, and code benchmarks, while reducing output length by roughly 50 % and cutting redundant text.

Iterative self‑evolution—using the updated model as the next teacher—yields an additional >3 % accuracy gain.

Inference Engine: AggAgent Reconstructs Long‑Trajectory Aggregation

For deep agentic tasks that involve hundreds of interaction steps (e.g., web search or extensive research), traditional majority voting only checks the final answer and discards valuable intermediate information. Summarization‑based aggregation (SummAgg) preserves more detail but incurs heavy computational overhead. AggAgent proposes a different paradigm: parallelly generated trajectories are treated as a searchable virtual environment. A dedicated aggregation agent uses four tools— get_solution , search_trajectory , get_segment , and finish —to retrieve stage conclusions, perform keyword searches within a trajectory, extract key steps, and synthesize a final solution.

This design lets AggAgent identify minority‑correct answers, resolve contradictions across trajectories, and combine fragmented evidence. Because all operations run on local data structures, the aggregation cost is only 5.7 % of the base runtime, far lower than the 41 % overhead of summarization methods. Empirical results with base models such as GLM‑4.7‑Flash and Qwen3.5‑122B show an average absolute performance improvement of up to 5.3 % over existing aggregation strategies, and a Pareto‑efficient trade‑off between cost and latency.

Further analysis demonstrates that a “synthesis” strategy (rewriting and integrating information) outperforms a simple “selection” of the best candidate, especially on open‑ended deep‑research tasks.

Conclusion

Both SD‑ZERO and AggAgent address the same core challenge: reducing brute‑force compute by mining the value hidden in a model’s intermediate reasoning trajectories. Teaching models to self‑revise and actively aggregate their own outputs offers a more direct and efficient path toward complex reasoning than merely scaling parameters or sampling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsself-distillationOn-Policy DistillationAggAgentcomplex reasoningself-revision
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.