Repository Intelligence & Context-Aware AI

44 min read

How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation

JoyCode Agent reaches a 74.6% pass rate on the authoritative SWE‑bench Verified benchmark, ranking in the global top‑3, and is now open‑source, showcasing a high‑efficiency, test‑driven, iterative approach to automated code repair that dramatically reduces token consumption while improving success rates.

JD Tech Talk

Nov 3, 2025

How JoyCode Agent Achieves 74.6% Pass@1 on SWE‑bench Verified with Patch‑Test Co‑generation

In the authoritative SWE‑Bench Verified benchmark, JoyCode Agent achieves a high pass rate of 74.6% and ranks in the global top 3, and is now open‑source!

GitHub open‑source address: https://github.com/jd-opensource/joycode-agent

Gitee open‑source address: https://gitee.com/JD-opensource/joycode-agent

JoyCode Agent demonstrates outstanding ability to solve complex programming problems. Compared with advanced solutions on the leaderboard, JoyCode Agent achieves similar performance while reducing computational resource consumption by 30%‑50% . This not only shows its efficient handling of complex coding challenges but also highlights its high cost‑effectiveness and commercial value in practical applications.

Abstract

SWE‑bench is a representative benchmark for automatic software engineering repair, requiring agents to efficiently generate and verify patches. With the development of large language models, real‑world software engineering tasks can be further solved, but prompt‑engineering methods no longer handle repository‑level repair tasks well. To address this dilemma, this paper proposes an automatic repair framework centered on “patch‑test co‑generation and iterative verification”.

Specifically, we first generate initial patches for a specific code repository in a Docker‑isolated environment and simultaneously generate two types of tests, Fail2Pass and Pass2Pass . The generated patches and tests are then validated; when all tests pass, the patch is output as the final repair result. Experiments show that this framework significantly improves patch correctness on the SWE‑bench Verified benchmark.

1. Project Background and Goals

1.1 SWE‑bench Task Overview

SWE‑bench Verified, developed by Princeton University and others, is a software‑engineering benchmark for evaluating AI systems on real‑world software problems. It collects real GitHub Issues from well‑known Python projects such as scikit‑learn, matplotlib, and requests, requiring AI models to understand the problem description, analyze the codebase structure, and generate patches that fix bugs or implement new features. Unlike traditional code‑generation tasks, SWE‑bench Verified tests comprehensive programming ability in complex environments, including multi‑file coordination, context understanding, and business‑logic handling, with evaluation based on whether the generated code passes the full test suite on the first try.

1.2 Project Challenges and Score Overview

Code‑base level understanding and cross‑file reasoning: Unlike function‑level tasks, SWE‑bench involves real projects that require global understanding of the repository and cross‑module inference and repair.

Large solution space and candidate management: The possible repair paths are numerous; efficiently generating, filtering, and integrating candidates is a core technical bottleneck.

Diverse reasoning trajectories and iteration: Single‑turn LLM inference often converges to similar solutions, lacking diversity and leading to limited exploration of correct repairs.

Automated verification and feedback loop: Repair patches need automated testing and verification to ensure they truly solve the problem without introducing new faults.

In the Pass@1 official evaluation of SWE‑bench Verified, JoyCode Agent currently achieves a pass rate of 74.6% . This score comes from the complete “patch‑test co‑generation and iterative verification” workflow: first generate Pass2Pass and Fail2Pass tests, then generate the initial patch, execute tests in a Docker environment, and for failed cases apply “trajectory compression + CRS retrieval + second‑round retry”.

2. Industry Status and Optimization Ideas

2.1 Industry Status and Pain Points

Prompt‑engineering “one‑shot generation” fails on repository‑level tasks because a single inference cannot cover code dependencies, cross‑file semantics, and historical context, leading to “semantic drift” and “fragile consistency”.

Typical symptoms: patches pass a few test cases but fail on the full suite; issue‑specific “over‑fitting” repairs.

Direct impact: large success‑rate fluctuations, poor reproducibility, and difficulty achieving stable Pass@1 improvements.

Failure only reports errors without attribution, leading to directionless retries.

Typical symptoms: repeated identical errors, high proportion of ineffective retries, and token consumption mainly on “internal error distribution” exploration.

Lack of experience reuse; long‑tail difficult cases have low convergence and limited coverage.

Typical symptoms: similar problems in the same repository cannot be migrated, and new issues may be introduced.

Token consumption explosion, cost‑benefit imbalance.

Typical symptoms: token usage and latency increase rapidly, with diminishing returns per token.

Multiple‑round agent error accumulation and path dependency.

Typical symptoms: early errors propagate, leading to final patches unrelated to the problem or introducing side effects.

2.2 Our Optimization Ideas and Advantages

Patch‑test co‑generation and iterative verification forms a closed loop: generate tests first, then generate patches, and iterate based on test feedback. This ensures that patches are produced with clear verification criteria.

Key advantages:

For “one‑shot generation” failure, we treat repository‑level repair as a “generate‑verify‑correct” loop, coupling patch generation with test generation and validation.

For “failure only reports error”, we introduce fine‑grained failure attribution to distinguish whether the problem lies in the patch, the test, or the environment, enabling targeted retries.

For “lack of experience reuse”, we compress execution trajectories and retrieve similar successful cases (CSR) to provide prior knowledge for the second‑round retry.

For “token consumption explosion”, we integrate test co‑generation, failure attribution, and experience transfer into a unified strategy stack, using voting only as a final arbiter.

3. Overall System Architecture

The system implements an end‑to‑end “patch‑test co‑generation and iterative verification” pipeline with the following key designs:

Test co‑generation : For a given issue, generate Fail2Pass and Pass2Pass tests that satisfy “Fail2Pass should fail before repair and pass after repair”, while Pass2Pass tests must pass both before and after.

First‑round patch generation and container verification : In an isolated Docker environment, generate a patch based on the issue and run the generated tests; if all tests pass, the patch is accepted.

Trajectory compression + CSR retrieval : Compress the execution trajectory of the first round into a structured summary, store it in a trajectory pool, and retrieve the most similar successful trajectory for the failed case.

Second‑round retry : Combine the original failed trajectory and the retrieved successful trajectory as prior knowledge for a new patch generation; if no similar case exists, perform a “no‑experience” retry.

Decision voting : Use a Decision Agent to vote between the first‑round and second‑round patches and output the final patch.

The overall loop “Fail2Pass & Pass2Pass test constraints → first‑round patch generation & container verification → trajectory compression & CSR retrieval → second‑round retry” yields a high‑quality patch pool and reproducible engineering pipeline.

4. Agent Structural Design

4.1 Patch Agent

Function Overview : Patch Agent is a reactive‑agent architecture that mimics a human developer’s workflow through a continuous “observe‑think‑act” loop, dynamically adjusting its strategy based on real‑time feedback.

Observe : Parse the issue description, extract the core problem, and explore the codebase using Bash commands (ls, grep, find) to locate relevant files and reproduce the bug.

Think : Build a reasoning chain that outlines step‑by‑step actions (e.g., add a parameter, update call sites, run tests). The plan can be interleaved with tool execution, allowing dynamic adjustments.

Action : Execute the plan using Bash for environment setup and a code‑editing tool for precise modifications (insert, replace, delete). After each change, run the original test suite to verify correctness; failures feed back into the observation stage for another iteration.

When the patch passes all required validations, Patch Agent outputs a standard diff‑format patch file.

Tool Design :

Code Editing Tool : Provides atomic operations (insert, replace, delete) on files based on contextual information, ensuring minimal and clear changes.

Bash Tool : Executes shell commands for file creation, dependency installation, and test execution, enabling file operations, environment management, and test feedback collection.

Reasoning Chain Tool : Structures high‑level strategies into actionable steps and supports dynamic plan adjustments.

Input/Output :

Input: Issue (detailed problem description) and Location (path to the repository root).

Output: Patch (standard diff string).

4.2 Testing Agent

Function Overview : Testing Agent automatically generates targeted test cases for evaluating patches, creating three complementary tests:

Fail‑to‑Pass (Error Reproduction) : Should fail on the buggy code and pass after the patch.

Pass‑to‑Pass (Regression Protection) : Must pass both before and after the patch.

Pass‑to‑Pass (Edge Detection) : Also passes before and after, focusing on boundary conditions.

Generated tests undergo a “pre‑validation” step: all three tests run on the original buggy code; only if Fail‑to‑Pass fails and the two Pass‑to‑Pass tests pass are the tests considered valid.

After validation, the tests are used to evaluate patches. If a patch passes all tests, it is marked successful; otherwise, the system triggers either a basic retry (if the test is faulty) or an experience‑driven retry (if the patch is faulty).

Tool Calls :

File operations: create test files with mkdir and cat.

Dependency management: install pytest if missing.

Test execution: run pytest and capture exit codes.

Design Purpose : Provides a precise, executable definition of “repair success”, enabling the system to reason about failures and guide subsequent retries.

4.3 CSR Agent

Function Overview : When a patch fails the generated tests, CSR Agent performs failure attribution, trajectory compression, similarity retrieval, and experience‑driven retry.

Trajectory Compression : After each Patch Agent run, the full execution log (observations, thoughts, tool calls) is compressed into a structured summary and stored in a trajectory pool.

Root‑Cause Decision : Using a large model, the system analyzes the issue, test cases, failed patch, and test results to decide whether the failure originates from the test or the patch.

If the test is at fault, a basic retry is triggered.

If the patch is at fault, an experience retry is launched.

CSR Similarity Retrieval : For patch‑related failures, the agent queries the trajectory pool with the current issue to find the most similar successful case and retrieves its compressed trajectory.

Experience Retry : Patch Agent receives both its own failed compressed trajectory and the retrieved successful trajectory, performs a “dialectical learning” process (analyze failures, adopt successful strategies), and generates a new, improved patch. The two patches are then voted on by the Decision Agent.

Input/Output :

Input: Raw trajectory of the failed run, issue description, test cases, failed patch, and test results.

Output: Failure cause label (“PATCH” or “TEST”), compressed original trajectory, compressed similar successful trajectory.

4.4 Decision Agent

Function Overview : Decision Agent acts as the arbiter when two candidate patches are available, selecting the optimal one based on code quality, correctness, minimality, and risk.

Scenarios :

Basic Retry : Triggered when the Testing Agent cannot generate valid tests. The system runs Patch Agent a second time under identical conditions, producing Patch B, and then votes between Patch A and Patch B.

Experience Retry : Triggered when tests are valid but the first patch fails. CSR Agent retrieves a similar successful trajectory; Patch Agent uses both the failed and successful trajectories to generate Patch B, after which Decision Agent votes between Patch A and Patch B.

Input/Output :

Input: Issue, Patch A, Patch B.

Output: solution_index (selected patch index) and basis_of_reasoning (a concise justification for the choice).

5. Typical Workflow Example

Using the django‑16454 instance, the full workflow proceeds through four stages:

Automated Test Generation : The system initializes a Docker environment, installs dependencies, analyzes the code, and generates three valid tests (Fail‑to‑Pass, Pass‑to‑Pass, Edge). Pre‑validation confirms the test suite meets the required structure.

Patch Generation and Verification : Patch Agent explores the repository, reproduces the bug, plans a fix, applies precise code edits, and produces an initial patch. The patch fails the test suite, indicating a regression.

Retry Strategy Selection and Execution : The system compresses the failed trajectory, attributes the failure to the patch, retrieves a similar successful case via CSR, and performs an experience‑driven retry to generate a second patch.

Voting for the Optimal Patch : Decision Agent evaluates both patches and selects the second, higher‑quality patch as the final solution.

Conclusion

This technical report systematically outlines JoyCode Agent’s core architecture, innovative algorithms, and engineering optimizations for automated software repair. From deep repository‑level code understanding to a “patch‑test co‑generation and iterative verification” closed loop, and from multi‑agent collaboration to fine‑grained failure attribution and experience reuse, JoyCode Agent achieves high pass rates and significant resource savings on the SWE‑bench Verified benchmark, demonstrating the team’s deep insight and technical maturity in AI‑driven software engineering.

Looking ahead, we will continue to iterate on JoyCode Agent’s capabilities, enhance multi‑agent coordination, improve experience reuse mechanisms, and expand repository‑level repair solutions. In the next phase, we plan to open‑source the core technologies (currently undergoing external approval), deepen community collaboration, and publish related patents to further strengthen our innovation leadership.

Join our open‑source community at: https://github.com/jd-opensource/joycode-agent https://gitee.com/JD-opensource/joycode-agent