How JoyCode Agent Scored 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑generation Loop

JoyCode Agent leverages a patch‑test co‑generation and iterative validation framework to achieve a 74.6% Pass@1 score on the SWE‑bench Verified benchmark, reducing resource consumption by 30‑50% and introducing a closed‑loop multi‑agent pipeline that integrates testing, patch generation, trajectory compression, similarity retrieval, and decision arbitration.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How JoyCode Agent Scored 74.6% Pass@1 on SWE‑bench Verified with a Patch‑Test Co‑generation Loop

Background and Goal

SWE‑bench Verified is a widely used benchmark for evaluating AI systems on real‑world software‑engineering tasks. It requires agents to understand issue descriptions, analyse codebases, and generate patches that pass the full test suite. Existing prompt‑engineering approaches struggle with repository‑level repairs, prompting the authors to propose a new framework centered on "patch–test co‑generation and iterative verification".

Project Challenges and Scores

Code‑base level understanding and cross‑file reasoning are required, far beyond function‑level tasks.

The search space of candidate patches is huge, making efficient generation, filtering and integration difficult.

LLM reasoning tends to converge to similar solutions, limiting exploration of diverse fixes.

Automated verification and feedback loops are still immature, leaving much room for improvement.

Using the proposed framework, JoyCode Agent achieved a 74.6% Pass@1 rate on the official SWE‑bench Verified evaluation, a result that comes from the full "patch–test co‑generation → first‑round patch generation & container verification → trajectory compression & CSR retrieval → second‑round retry" pipeline. Compared with leading baselines, the approach reduces compute consumption by 30‑50% while delivering higher correctness.

Industry Pain Points

2.1 Prompt‑engineering "one‑shot" generation fails on repository‑level tasks

Single‑round LLM inference cannot cover complex dependencies, cross‑file semantics, and historical context, leading to semantic drift and fragile consistency.

Typical symptoms include patches that pass a few tests but fail on the full suite, or over‑fitting to a specific issue.

Consequences are unstable success rates and poor reproducibility.

2.2 Failure reporting only "error" without attribution

Lack of systematic failure modelling makes it impossible to distinguish between patch logic errors, tool errors, or environment issues, resulting in blind retries.

Symptoms are repeated identical errors and high proportion of ineffective re‑runs, with token consumption wasted on exploring within the error distribution.

2.3 Missing experience reuse for long‑tail cases

Most systems treat each instance as independent, without building transferable representations of successful strategies or failure patterns.

This leads to repeated exploration of similar dead‑ends in large projects such as Django or scikit‑learn.

2.4 Token explosion and cost‑benefit imbalance

Repeated independent runs, exhaustive sampling, and unfiltered voting cause token usage to grow rapidly, reducing marginal returns.

2.5 Multi‑round agent error accumulation and candidate selection

Long chains of agent actions amplify early mistakes, making later voting ineffective if the candidate pool is already low‑quality.

Our Optimization Approach

2.2.1 Solving the "one‑shot" failure

Recognise repository‑level repair as a "generate‑verify‑revise" closed loop rather than a single generation step.

Couple patch generation with Fail2Pass and Pass2Pass test creation, turning the process into a test‑centric feedback loop.

Replace naïve single‑shot sampling with an iterative, evidence‑driven pipeline.

2.2.2 Precise failure attribution

Introduce a failure‑attribution module that classifies the root cause as either test‑related or patch‑related.

Based on the attribution, trigger either a basic retry (test issue) or an experience‑driven retry (patch issue), dramatically improving convergence and token efficiency.

2.2.3 Experience reuse via trajectory compression and CSR retrieval

Compress the execution trajectory of each first‑round run into a structured summary (strategy, key changes, highlights).

Store compressed trajectories in a global pool.

When a patch fails, retrieve the most similar successful trajectory from the pool (CSR – case‑based similarity retrieval) and inject its strategy as prior knowledge for the second‑round retry.

2.2.4 Controlling token consumption

Guide retries with targeted test generation and failure attribution, preventing blind sampling.

Integrate test‑co‑generation, failure attribution, and experience reuse into a single strategy stack, using voting only as a final safeguard.

2.2.5 Structured voting after verification

Place voting after the full verification‑attribution‑retry cycle, ensuring that only high‑quality candidates are considered, reducing wasted resources.

System Architecture

3.1 Overall Pipeline

The end‑to‑end pipeline consists of four core stages:

Test co‑generation: For each issue, generate Fail2Pass and Pass2Pass unit tests and pre‑validate them on the original buggy code.

First‑round patch generation & container verification: Run the generated tests inside an isolated Docker environment; if all tests pass, output the patch.

Trajectory compression & CSR retrieval: Compress the first‑round execution trace, store it, and retrieve a similar successful trajectory when needed.

Second‑round retry: Use the retrieved experience (or a basic retry) to guide a new patch generation, then let the Decision Agent vote between the two patches.

The pipeline yields a high‑quality patch pool that is reproducible and observable.

3.2 Interaction Strategy and Workflow

Four specialized agents cooperate:

Testing Agent: Builds a three‑test matrix (FAIL2PASS, PASS2PASS, edge) and ensures pre‑validation.

Patch Agent: Generates patches using an observe‑think‑act loop, leveraging code‑editing and Bash tools.

CSR Agent: Performs trajectory compression, root‑cause decision, similarity retrieval, and supplies experience for retries.

Decision Agent: Arbitrates between the initial and retried patches, outputting the selected index and a concise reasoning summary.

The agents exchange Issue, Location, generated tests, and patches, forming a fully automated repair loop.

Agent Designs

4.1 Patch Agent

Function: Implements a reactive‑agent architecture that continuously cycles through observation (issue parsing, code exploration), thinking (strategy planning via a reasoning chain), and action (code editing, Bash execution). It produces a standard diff‑format patch.

Tools:

Code‑editing tool for precise insert/replace/delete operations.

Bash tool for filesystem queries, dependency installation, and test execution.

Reasoning‑chain tool that decomposes high‑level goals into concrete action sequences.

Input/Output: Inputs are Issue (text) and Location (repo root). Output is a diff‑style Patch string ready for version‑control integration.

4.2 Testing Agent

Function: Automatically creates three complementary unit tests for each issue:

FAIL2PASS: Must fail on the buggy code and pass after the patch.

PASS2PASS (regression protection): Must pass both before and after the patch.

PASS2PASS (edge detection): Focuses on boundary conditions to ensure robustness.

All tests are pre‑validated on the original code; only when the matrix satisfies the required pattern (one FAIL2PASS, two PASS2PASS) are they considered valid for patch evaluation.

Tool usage: Uses Bash to create files, install dependencies, and run pytest, capturing exit codes for automated feedback.

4.3 CSR Agent

Functions:

Trajectory compression: Summarises the raw execution log of a first‑round Patch Agent run into a concise, structured abstract (strategy, key changes, highlights) and stores it in a global pool.

Root‑cause decision: Analyses Issue, generated tests, patch, and test results to label the failure as either TEST or PATCH.

CSR similarity retrieval: When the failure is PATCH‑related, queries the trajectory pool with the current Issue to fetch the most similar successful compressed trajectory.

Experience retry: Supplies both the failed trajectory and the retrieved successful trajectory to the Patch Agent, which then performs a guided re‑generation.

Outputs include the compressed trajectory, failure cause label, and the retrieved similar trajectory.

4.4 Decision Agent

Function: Acts as the final arbitrator. When two candidate patches exist (initial and retried), it evaluates them on criteria such as logical correctness, alignment with the issue, minimality, code quality, and boundary coverage, then votes for the superior one.

Modes:

Base retry: Triggered when test generation fails or is invalid; the system re‑runs the Patch Agent with identical inputs and votes between the two patches.

Experience retry: Triggered when tests are valid but the first patch fails; the system invokes CSR retrieval and experience‑driven re‑generation before voting.

Output: Returns solution_index (selected patch) and basis_of_reasoning (a concise justification).

Typical Workflow Example (django‑16454)

The authors walk through a real instance from the SWE‑bench dataset:

Test generation: Issue text is extracted, a Docker container is prepared, and three tests (FAIL2PASS, PASS2PASS, edge) are generated and pre‑validated.

First patch attempt: Patch Agent analyses the code, creates a patch, and runs the tests; the patch fails both the original bug test and a regression test.

Failure attribution & CSR: The system compresses the execution trace, determines the failure is PATCH‑related, and retrieves a similar successful trajectory from the pool.

Experience retry: Using both the failed and successful trajectories, Patch Agent generates a refined patch.

Voting: Decision Agent compares the two patches and selects the second, higher‑quality patch as the final solution.

This example demonstrates the full closed‑loop: test co‑generation, first‑round verification, trajectory‑based knowledge reuse, guided retry, and final arbitration.

Conclusion

JoyCode Agent’s architecture—combining patch‑test co‑generation, trajectory compression, CSR‑based experience reuse, and a dedicated decision arbitrator—delivers a 74.6% Pass@1 rate on SWE‑bench Verified while cutting compute usage by 30‑50%. The system is fully open‑source (GitHub:

GitHub repository
GitHub repository

; Gitee:

Gitee repository
Gitee repository

), and the authors plan further open‑source releases, community collaborations, and patent filings to advance AI‑driven software repair.

AILLMMulti-agentsoftware-engineeringSWE-Benchcode repair
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.