How JoyCode Agent Reached 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑Generation Loop
This technical report details JoyCode Agent’s end‑to‑end pipeline that couples patch generation with fail‑to‑pass and pass‑to‑pass test creation, uses trajectory compression, CSR similarity retrieval, and multi‑agent iterative retries to achieve a 74.6% Pass@1 score on the SWE‑bench Verified benchmark while cutting compute costs by 30‑50%.
Background
SWE‑Bench Verified is a benchmark that evaluates AI agents on real‑world software‑engineering tasks such as bug fixing and feature implementation in popular Python projects. The metric of interest is Pass@1 – the ability to produce a patch that passes the full test suite on the first attempt.
Key Challenges in Repository‑Level Repair
Understanding the entire code base and reasoning across multiple files.
Exploring a huge candidate‑patch space efficiently.
Limited diversity of reasoning trajectories, leading to convergent (often sub‑optimal) solutions.
Immature automated verification and feedback loops.
Token‑budget explosion caused by repeated blind retries.
Proposed Framework: Patch–Test Co‑Generation & Iterative Verification
The core idea is to generate two complementary unit‑test families together with the patch:
Fail2Pass – a test that must fail on the buggy code and pass after the patch.
Pass2Pass – a test that must pass both before and after the patch (regression protection).
If the patch passes all generated tests, it is emitted as the final solution. Otherwise a systematic validation‑and‑retry loop determines whether the failure originates from the test or the patch and triggers a targeted regeneration.
System Architecture
The pipeline consists of four cooperating agents:
Testing Agent : analyses the issue, generates one Fail2Pass and two Pass2Pass tests, pre‑validates them on the buggy repository, and supplies them to the Patch Agent.
Patch Agent : follows an observe‑think‑act loop, uses Bash tools for repository inspection, a code‑editing tool for precise modifications, and a reasoning chain to plan changes. It produces an initial patch and runs it inside an isolated Docker container against the generated tests.
CSR Agent : when the initial patch fails, compresses the execution trajectory into a concise “strategy / key change / insight” record, classifies the failure (test vs. patch), and performs similarity‑based retrieval (CSR) against a pool of successful trajectories to provide experience for the next retry.
Decision Agent : receives the original patch and the experience‑driven retry, evaluates them on correctness, minimality, risk, and test coverage, and votes for the optimal solution.
Key Techniques
Fail2Pass / Pass2Pass Test Generation : guarantees that a correct patch must turn a failing test into a passing one while preserving existing functionality.
Trajectory Compression : summarises the full reasoning and tool‑call log into a compact record (strategy, key change, insight) for storage.
CSR Similarity Retrieval : searches the compressed trajectory pool for the most similar successful case and returns its strategy as a prior.
Experience‑Driven Retry : combines the failure trajectory with the retrieved successful trajectory to guide a second patch generation.
Voting Arbitration : Decision Agent selects the best patch based on logical correctness, alignment with the issue, code quality, and test coverage.
Workflow Details
1. Test Generation & Pre‑validation – The Testing Agent creates three tests, runs them on the original buggy code, and ensures the expected pattern (1 Fail2Pass, 2 Pass2Pass). Only validated tests are passed forward.
2. First Patch Generation & Containerized Verification – The Patch Agent launches a Docker image that mirrors the SWE‑Bench environment, inspects the repository, plans a modification sequence, applies edits with the code‑editing tool, and immediately runs the generated tests. Success ends the workflow; failure proceeds to step 3.
3. Failure Attribution & Experience Retrieval – The CSR Agent compresses the raw execution log, classifies the root cause (test‑side or patch‑side), and if the cause is patch‑side performs a similarity search over the trajectory pool. The most similar successful trajectory (strategy, key change, insight) is returned.
4. Experience‑Driven Retry – The Patch Agent receives both its own failure summary and the retrieved successful strategy, performs a “reflect‑and‑reuse” reasoning step, and generates a second patch.
5. Decision Voting – The Decision Agent compares the two candidate patches on multiple criteria (correctness, minimality, risk, code quality, test coverage) and selects the final patch for submission.
Agent Tooling
Code Editing Tool : atomic insert/replace/delete operations on files, producing a standard diff output.
Bash Tool : executes shell commands for file discovery, dependency installation ( pip install), and test execution ( pytest), capturing exit codes.
Reasoning Chain Tool : structures high‑level goals into ordered action sequences, allowing interleaved tool calls and dynamic plan adjustments.
Input/Output Definitions
Testing Agent
Input: Issue (text description) and Location (path to repository root).
Output: three test files – Test_Fail2Pass.py, Test_Pass2Pass_A.py, Test_Pass2Pass_B.py.
Patch Agent
Input: Issue, Location, and the three test files.
Output: Patch (standard diff) and immediate test results.
CSR Agent
Input: raw trajectory log, Issue, generated tests, and the failing patch.
Output: Compressed_Trajectory, Failure_Cause (TEST or PATCH), and, if needed, Similar_Trajectory from the pool.
Decision Agent
Input: Issue, Patch_A (first attempt), Patch_B (retry).
Output: solution_index (selected patch) and basis_of_reasoning (short justification).
Experimental Results
On the official SWE‑Bench Verified Pass@1 leaderboard, the JoyCode implementation achieved a 74.6 % pass rate, placing in the global Top‑3. Compared with leading baselines, it reduced compute consumption by 30‑50 % while delivering comparable or better patch quality.
Resource Usage
Total LLM calls per instance: 7 (testing, patch generation, trajectory compression, root‑cause decision, similarity retrieval, voting). Testing calls dominate (~70 % of calls).
Open‑Source Release
The full implementation, including Docker images, agent definitions, and the trajectory pool, is available at https://github.com/jd-opensource/joycode-agent and mirrored on Gitee at https://gitee.com/JD-opensource/joycode-agent.
Conclusion
The tightly coupled patch‑test generation loop, enriched with trajectory compression and experience‑driven retries, substantially improves both success rate and efficiency for repository‑level automated repair. Future work includes expanding the trajectory pool, refining the CSR similarity metric, and further open‑sourcing components to foster community collaboration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
