After Testing 225 Real Java Tasks, JetBrains Makes Codex the Default AI Agent

JetBrains evaluated 225 authentic Java development tasks across multiple repositories, comparing Codex with other AI coding agents on completion rate, latency, cost, and real user behavior, and concluded that Codex should become the default agent for Java developers.

LuTiao Programming
LuTiao Programming
LuTiao Programming
After Testing 225 Real Java Tasks, JetBrains Makes Codex the Default AI Agent

JetBrains recently conducted a large‑scale evaluation of AI coding agents by running 225 real‑world Java tasks drawn from 17 repositories and five organizations. Each task required the agent to modify the codebase, compile the project, and pass the associated automated tests, mirroring everyday development work.

Evaluation methodology and metrics

The assessment focused on three practical questions rather than raw model scores: can the task be completed, how long does it take, and what cost does it incur? For Codex (using GPT‑5.4 mini), the Java task completion rate was 43.9%, with a median latency of 124 seconds and a median cost of $0.129 per task. Another candidate showed slightly better completion and cost figures, but JetBrains chose Codex because it performed consistently across Java, C#, and Python and excelled in an online A/B test measuring user willingness to continue using the agent, frequency of switching agents, failure rates, and fallback to ordinary chat.

Why completion rate alone is insufficient

The article illustrates two hypothetical agents handling a Spring Boot interface task. Agent A finishes in ten minutes, modifies six files, and produces clean code that requires minimal review. Agent B finishes faster (eight minutes) but changes 18 files, introduces unrelated refactoring, and forces a half‑hour manual review. Although both achieve test success, Agent A yields higher real‑world productivity due to lower post‑generation effort.

Building a team‑specific agent test suite

Teams are encouraged to create a small internal benchmark of 20–30 representative tasks drawn from historical issues, bugs, and merged PRs. Example task specifications (shown below) include clear background, goals, constraints, and acceptance criteria, enabling consistent measurement of agent performance.

任务背景:InventoryConsumer 在消息重复投递时,可能多次扣减同一订单的库存。
目标:修复重复消费问题,保证同一个订单消息只会成功扣减一次库存。
限制:
1. 不要修改消息 Topic 和消费者组。
2. 不要修改库存表现有字段。
3. 不要关闭 MQ 重试。
4. 不要通过捕获并忽略异常来让测试通过。
5. 保持现有接口兼容。
验收要求:
1. 正常消息可以成功扣减库存。
2. 同一业务消息重复投递时不会重复扣减。
3. 处理失败时仍然允许 MQ 重试。
4. 必须补充重复消费测试。
5. 运行 inventory-service 相关测试。

Acceptance criteria beyond "tests pass"

Successful completion must satisfy several hard rules: the project must compile, all specified tests must pass, modifications should stay within the task scope, no production configuration or secret changes are allowed, and the agent must output a summary of changes and potential risks.

Measuring manual effort

After an agent generates a solution, teams should record the initial completion time, number of iterations, files changed, unrelated modifications, developer correction time, whether the PR can be merged directly, and the average cost per successful task. This data reveals hidden labor that can outweigh raw speed gains.

Limitations of public benchmarks

Public leaderboards (SWE‑bench, Terminal‑Bench, etc.) provide a quick view of model capabilities but often miss domain‑specific nuances such as Spring Boot, MyBatis, or Kafka usage patterns, and they cannot capture implicit business knowledge embedded in codebases.

Documenting project context with AGENTS.md

Including a concise AGENTS.md file that lists module boundaries, build commands, component responsibilities, exception handling standards, and prohibited modifications helps agents understand the project's conventions and reduces context‑related errors.

Continuous re‑evaluation

Since AI models evolve rapidly, teams should periodically rerun their internal test suite to verify that newer agent versions truly improve on completion rate, latency, cost, and developer experience. The default agent is not a permanent choice; it should be revisited as tools advance.

Implications for Java developers

Evaluating AI agents becomes a core engineering skill, akin to assessing databases or middleware. Developers must consider not only code generation quality but also module boundary respect, test reliability, change scope, and review effort.

In summary, JetBrains' decision to set Codex as the default Java agent underscores a shift from demo‑centric hype to rigorous engineering‑level evaluation, and it provides a clear roadmap for teams to build their own systematic assessments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaAI codingSoftware EngineeringJetBrainsCodexAgent evaluation
LuTiao Programming
Written by

LuTiao Programming

LuTiao Programming is a friendly community offering free programming lessons. We inspire learners to explore new ideas and technologies and quickly acquire job-ready skills.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.