9 min read

Do AI Code Generators Really Speed Up Development? RCT Shows 19% Slowdown

A randomized controlled trial with 16 veteran developers working on 246 real open-source issues found that using generative AI tools such as Cursor Pro with Claude actually increased average task time by 19 %, contrary to participants’ expectations of a 24 % efficiency gain, highlighting limitations of AI in complex, multi-module development contexts.

FunTester

Jul 15, 2025

Do AI Code Generators Really Speed Up Development? RCT Shows 19% Slowdown

Research Background

In recent years, code‑generation AI tools such as GitHub Copilot, ChatGPT, Claude and Cursor have proliferated, being applied to code completion, documentation generation, test‑case creation and other development tasks. While they claim large efficiency gains, most evaluations rely on static benchmarks that ignore the complexity of real development scenarios.

Study Objective

This study focuses on the actual effectiveness of AI tools in maintaining real open-source projects, especially their impact on development cycle acceleration and the associated risk of AI outpacing human developers before safety mechanisms are mature.

Experimental Design

We conducted a randomized controlled trial (RCT) with 16 senior developers who have years of experience in large open-source projects (average 22 k stars, over one million lines of code). A total of 246 real issues covering feature development, bug fixing and architectural refactoring were selected.

Each issue was randomly assigned to either allow AI assistance or require manual development. When AI was permitted, developers could freely use generative AI (e.g., Cursor Pro with Claude 3.5/3.7 Sonnet). Tasks lasted about two hours; time was recorded via screen capture and participants were paid $150 per hour. Pre‑ and post‑experiment questionnaires assessed perceived difficulty and AI’s subjective efficiency impact.

The experiment simulated typical developer work, such as fixing a distributed‑system bug that involves cross‑module debugging, log analysis and performance optimization.

Main Findings

Using AI tools increased average task time by 19 %. Efficiency declined rather than improved. Participants did not notice this slowdown; before the trial they expected a 24 % gain, and after the trial they still reported an approximate 20 % improvement, indicating a significant perception bias.

AI performed poorly on large‑scale tasks with multiple module dependencies and complex contexts. For example, when fixing a database connection‑pool leak, AI‑generated code appeared reasonable but ignored concurrency considerations, leading to longer debugging cycles.

Scope of Applicability

The study subjects were senior developers; results may differ for junior developers or non‑technical users, who might benefit more from simple code‑completion features.

The findings apply to large‑scale open-source projects; smaller tasks or teaching scenarios may be more suitable for AI assistance.

Current AI models have limited context memory and reasoning; future model upgrades could change conclusions.

Analysis of Efficiency Drag

Unstable code quality: AI often produces syntactically correct but semantically flawed code, requiring extensive verification.

Insufficient context accumulation: AI struggles to maintain understanding across files and modules, especially in micro‑service architectures.

Poor prompt engineering: Vague prompts like “optimize this code” yield irrelevant suggestions.

High debugging and validation cost: AI‑generated code may bypass established frameworks, increasing review burden.

Over‑reliance on tools: Some developers slow their own analysis by excessively tweaking prompts.

Benchmark vs. Real‑World Gap

Popular AI programming benchmarks (e.g., SWE‑Bench, RE‑Bench) report high success rates because they assume clear problem statements and complete context, evaluating only code diffs. Real development tasks involve ambiguous issue descriptions, cross‑module collaboration, iterative debugging, and strict quality requirements, exposing AI’s “understand, act fast, modify accurately” shortcomings.

Insights and Future Plans

The study provides a realistic perspective on AI’s contribution to development and informs risk assessment of AI‑driven R&D acceleration. Future work will:

Track AI model version evolution and reassess performance in real scenarios.

Expand participants to include junior developers and additional task types such as front‑end development and algorithm optimization.

Build an automated evaluation framework based on real pull‑request data to improve result traceability.

Explore efficient prompt writing and AI‑assisted code review strategies to better embed AI into development workflows.

Continued research aims to offer clearer guidance for the responsible use of AI tools, helping developers unlock AI’s potential in complex engineering projects.