How JoyCode Agent Reached 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑Generation Loop

This technical report details JoyCode Agent’s end‑to‑end pipeline that couples patch generation with fail‑to‑pass and pass‑to‑pass test creation, uses trajectory compression, CSR similarity retrieval, and multi‑agent iterative retries to achieve a 74.6% Pass@1 score on the SWE‑bench Verified benchmark while cutting compute costs by 30‑50%.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How JoyCode Agent Reached 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑Generation Loop

Background

SWE‑Bench Verified is a benchmark that evaluates AI agents on real‑world software‑engineering tasks such as bug fixing and feature implementation in popular Python projects. The metric of interest is Pass@1 – the ability to produce a patch that passes the full test suite on the first attempt.

Key Challenges in Repository‑Level Repair

Understanding the entire code base and reasoning across multiple files.

Exploring a huge candidate‑patch space efficiently.

Limited diversity of reasoning trajectories, leading to convergent (often sub‑optimal) solutions.

Immature automated verification and feedback loops.

Token‑budget explosion caused by repeated blind retries.

Proposed Framework: Patch–Test Co‑Generation & Iterative Verification

The core idea is to generate two complementary unit‑test families together with the patch:

Fail2Pass – a test that must fail on the buggy code and pass after the patch.

Pass2Pass – a test that must pass both before and after the patch (regression protection).

If the patch passes all generated tests, it is emitted as the final solution. Otherwise a systematic validation‑and‑retry loop determines whether the failure originates from the test or the patch and triggers a targeted regeneration.

System Architecture

The pipeline consists of four cooperating agents:

Testing Agent : analyses the issue, generates one Fail2Pass and two Pass2Pass tests, pre‑validates them on the buggy repository, and supplies them to the Patch Agent.

Patch Agent : follows an observe‑think‑act loop, uses Bash tools for repository inspection, a code‑editing tool for precise modifications, and a reasoning chain to plan changes. It produces an initial patch and runs it inside an isolated Docker container against the generated tests.

CSR Agent : when the initial patch fails, compresses the execution trajectory into a concise “strategy / key change / insight” record, classifies the failure (test vs. patch), and performs similarity‑based retrieval (CSR) against a pool of successful trajectories to provide experience for the next retry.

Decision Agent : receives the original patch and the experience‑driven retry, evaluates them on correctness, minimality, risk, and test coverage, and votes for the optimal solution.

Key Techniques

Fail2Pass / Pass2Pass Test Generation : guarantees that a correct patch must turn a failing test into a passing one while preserving existing functionality.

Trajectory Compression : summarises the full reasoning and tool‑call log into a compact record (strategy, key change, insight) for storage.

CSR Similarity Retrieval : searches the compressed trajectory pool for the most similar successful case and returns its strategy as a prior.

Experience‑Driven Retry : combines the failure trajectory with the retrieved successful trajectory to guide a second patch generation.

Voting Arbitration : Decision Agent selects the best patch based on logical correctness, alignment with the issue, code quality, and test coverage.

Workflow Details

1. Test Generation & Pre‑validation – The Testing Agent creates three tests, runs them on the original buggy code, and ensures the expected pattern (1 Fail2Pass, 2 Pass2Pass). Only validated tests are passed forward.

2. First Patch Generation & Containerized Verification – The Patch Agent launches a Docker image that mirrors the SWE‑Bench environment, inspects the repository, plans a modification sequence, applies edits with the code‑editing tool, and immediately runs the generated tests. Success ends the workflow; failure proceeds to step 3.

3. Failure Attribution & Experience Retrieval – The CSR Agent compresses the raw execution log, classifies the root cause (test‑side or patch‑side), and if the cause is patch‑side performs a similarity search over the trajectory pool. The most similar successful trajectory (strategy, key change, insight) is returned.

4. Experience‑Driven Retry – The Patch Agent receives both its own failure summary and the retrieved successful strategy, performs a “reflect‑and‑reuse” reasoning step, and generates a second patch.

5. Decision Voting – The Decision Agent compares the two candidate patches on multiple criteria (correctness, minimality, risk, code quality, test coverage) and selects the final patch for submission.

Agent Tooling

Code Editing Tool : atomic insert/replace/delete operations on files, producing a standard diff output.

Bash Tool : executes shell commands for file discovery, dependency installation ( pip install), and test execution ( pytest), capturing exit codes.

Reasoning Chain Tool : structures high‑level goals into ordered action sequences, allowing interleaved tool calls and dynamic plan adjustments.

Input/Output Definitions

Testing Agent

Input: Issue (text description) and Location (path to repository root).

Output: three test files – Test_Fail2Pass.py, Test_Pass2Pass_A.py, Test_Pass2Pass_B.py.

Patch Agent

Input: Issue, Location, and the three test files.

Output: Patch (standard diff) and immediate test results.

CSR Agent

Input: raw trajectory log, Issue, generated tests, and the failing patch.

Output: Compressed_Trajectory, Failure_Cause (TEST or PATCH), and, if needed, Similar_Trajectory from the pool.

Decision Agent

Input: Issue, Patch_A (first attempt), Patch_B (retry).

Output: solution_index (selected patch) and basis_of_reasoning (short justification).

Experimental Results

On the official SWE‑Bench Verified Pass@1 leaderboard, the JoyCode implementation achieved a 74.6 % pass rate, placing in the global Top‑3. Compared with leading baselines, it reduced compute consumption by 30‑50 % while delivering comparable or better patch quality.

Resource Usage

Total LLM calls per instance: 7 (testing, patch generation, trajectory compression, root‑cause decision, similarity retrieval, voting). Testing calls dominate (~70 % of calls).

Open‑Source Release

The full implementation, including Docker images, agent definitions, and the trajectory pool, is available at https://github.com/jd-opensource/joycode-agent and mirrored on Gitee at https://gitee.com/JD-opensource/joycode-agent.

Conclusion

The tightly coupled patch‑test generation loop, enriched with trajectory compression and experience‑driven retries, substantially improves both success rate and efficiency for repository‑level automated repair. Future work includes expanding the trajectory pool, refining the CSR similarity metric, and further open‑sourcing components to foster community collaboration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

software engineeringautomated testingmulti‑agent systemSWE-benchPatch GenerationAI code repair
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.