How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization

The paper introduces AP2O‑Coder, an adaptive progressive preference optimization framework that systematically captures error types, progressively refines LLM code generation, and dynamically adapts training data, achieving up to a 3% pass@k improvement across multiple open‑source models while reducing data requirements.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How AP2O‑Coder Cuts LLM Code Errors by Up to 3% with Adaptive Preference Optimization

01 Core Challenges of Existing Methods and AP2O‑Coder’s Targeted Design

Current offline preference‑optimization methods (e.g., DPO) for LLM code correction face three main challenges:

Lack of error‑type awareness : binary pass/fail signals from unit tests do not reveal specific error categories such as KeyError or ValueError, making it hard for the model to locate the root cause.

Insufficient training focus : random shuffling of training data forces the model to switch frequently among many error types, reducing the specificity of learning.

Weak dynamic adaptation : a static training set cannot keep up with the model’s evolving capabilities during fine‑tuning, leading to catastrophic forgetting or wasted resources.

To address these issues, AP2O‑Coder adopts a systematic “exam‑analysis‑correction‑quiz” workflow inspired by human problem‑solving strategies.

02 AP2O‑Coder’s Core Technical Framework and Workflow

The framework consists of four stages:

2.1 Code Generation Evaluation (Exam)

The target LLM generates N candidate solutions for M programming tasks with temperature 1.0 to explore its ability space. Unit tests label each candidate as pass or fail, forming the initial training set for error analysis.

2.2 Error Diagnosis Analysis (Analysis)

Using language‑specific analysis tools (e.g., the Python interpreter), all failing solutions are parsed, error types are annotated, and frequencies are counted, creating a structured “error notebook” for each error category.

2.3 Progressive Preference Optimization (Correction)

Based on the error notebook, the method orders optimization steps differently for small‑parameter models (≤ 0.5 B) and large‑parameter models (≥ 7 B). Small models follow a low‑frequency‑to‑high‑frequency (L2H) path, while large models use a high‑frequency‑to‑low‑frequency (H2L) path. A DPO sliding window generates ordered preference pairs ⟨prompt, correct answer, erroneous answer of type E⟩ for each step.

2.4 Adaptive Error Replay (Quiz)

During training, the model is periodically evaluated on a small validation set to capture current high‑frequency error types. Those error types that persist are re‑introduced into the training loop, dynamically adjusting data distribution to focus on the model’s present weaknesses and mitigating forgetting.

AP2O-Coder framework flowchart
AP2O-Coder framework flowchart

03 Experimental Validation and Results

The authors evaluated six mainstream LLMs—including CodeLlama, DeepSeek‑Coder, Qwen2.5‑Coder, Llama 3, Qwen2.5, and Qwen 3—covering parameter scales from 0.5 B to 34 B. Benchmarks comprised EvalPlus (HumanEval/MBPP) and LiveCodeBench v6.

3.1 Effectiveness of Performance Gains

AP2O‑Coder consistently improved pass@k scores. For example, on HumanEval, the H2L variant achieved a 2.8 %–3.4 % increase for models larger than 30 B without any degradation observed in existing post‑training methods.

Pass@1 performance on EvalPlus (HumanEval)
Pass@1 performance on EvalPlus (HumanEval)

3.2 Error Suppression and Generalization

Compared with SFT and DPO baselines, AP2O‑Coder reduced the frequency of all error types and introduced no new errors. In Qwen2.5‑Coder‑7B experiments, high‑frequency “WrongResult” errors dropped sharply, and rare errors such as IndexError were eliminated in later training stages. Pass@5 and pass@10 also showed stable improvements.

Error statistics during quiz phase for Qwen2.5-Coder-7B
Error statistics during quiz phase for Qwen2.5-Coder-7B

3.3 Sample‑Efficiency Optimization

By focusing on specific error types, AP2O‑Coder required only 4 %–60 % of the preference data needed by conventional DPO to reach comparable performance. The reduction was most pronounced for 32 B models, demonstrating suitability for low‑resource scenarios.

Preference data demand on MBPP for different model sizes
Preference data demand on MBPP for different model sizes

3.4 General LLM Adaptation

The method also proved effective for adapting general‑purpose LLMs (Qwen2.5, Qwen 3, Llama 3) to code generation tasks, yielding significant pass@1 improvements on MBPP.

Pass@1 performance of general LLMs on MBPP after adaptation
Pass@1 performance of general LLMs on MBPP after adaptation

04 Research Findings and Method Characteristics

Key observations include:

For Qwen2.5‑Coder, small models (≤ 3 B) benefit from a low‑frequency‑to‑high‑frequency (L2H) optimization order, which prevents early over‑fitting to common errors.

Large models (≥ 7 B) achieve better results with a high‑frequency‑to‑low‑frequency (H2L) order, leveraging their stronger learning capacity to quickly reduce overall error rates.

Paper title : AP2O‑Coder: Adaptively Progressive Preference Optimization for Reducing Compilation and Runtime Errors in LLM‑Generated Code

Paper link : https://arxiv.org/pdf/2510.02393 Open‑source code :

https://github.com/TsingZ0/AP2O
code generationmachine learningLLMsoftware engineeringpreference optimizationerror reductionAP2O-Coder
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.