Artificial Intelligence 16 min read

Who Tests When AI Generates 99% of Code? Inside a Self‑Repairing Agent Harness

The article explains how a self‑repairing Agent Harness replaces traditional QA by looping evaluation, triage, automated fixing, verification and AI‑gated canary release, using a three‑judge reviewer, model‑based sampling and six daily engineering tasks to keep AI‑driven products reliable.

AI Tech Publishing

Apr 29, 2026

Who Tests When AI Generates 99% of Code? Inside a Self‑Repairing Agent Harness

Core Argument: Evaluation and QA Are the Same Loop

In traditional SaaS companies, model evaluation and QA are separate; the former checks answer quality, the latter ensures production stability. For AI Agent platforms these concerns converge: a bad Agent reply is both a quality metric and a bug that must be fixed.

Component 1 – Reviewer: A Real‑Traffic Three‑Judge Panel

The reviewer receives every Agent reply via an asynchronous endpoint, attaching messageId, threadId and the degraded model used. Sampling is model‑based: the dominant production model (Sonnet 4.6) is sampled at 10 % while all minor or experimental models (Opus, GPT, Gemini, etc.) are sampled at 100 %.

Before scoring, a lightweight classifier (Task 0) routes each interaction to one of twelve domains (coding, research, data analysis, automation, etc.) so that each judge applies domain‑specific rules.

Three judges—one from Anthropic, one from OpenAI, and one from Google—run in parallel to mitigate self‑bias. Each judge returns a structured output with five fields: reasoning, category, quality (excellent, good, acceptable, poor), issues (nine possible categories), and confidence (0‑1).

Quality scores are mapped to a 1‑4 scale and averaged across judges, producing a continuous metric (e.g., 3.33 vs. 2.66). The averaged score is attached to the originating messageId and fed downstream.

Component 2 – Engineering Pipeline: Six Daily Tasks

The pipeline consumes reviewer scores, turning low‑score signals into bug reports that are processed through six sequential tasks, replacing manual QA triage, investigation, fixing, regression testing and approval.

Task 1 – Detection & Triage : An Agent clusters low‑quality judgments, scores each cluster on a nine‑dimensional severity engine (user impact, latency, resource pressure, etc.), and forwards clusters exceeding an emergency threshold.

Task 2 – Investigation : For the top three clusters, an Agent traces the monorepo, pulls CloudWatch logs, checks recent deployments and database replicas, assigns a root cause, and creates a Linear ticket with a full evidence package.

Task 3 – Automated Fix : High‑confidence, urgent issues trigger a branch creation, code change, verification and a draft PR to GitHub. Safeguards limit the process to three PRs per run, block any diff touching .env, .github/ or IAM policies, and reject type errors or test failures.

Task 4 – Verification : For tickets marked "in review", the system queries the past six hours of CloudWatch. If no failures are observed, it posts telemetry evidence to the PR comment and closes the ticket; otherwise it updates the error count and loops back.

Task 5 – Re‑review : Within 24 hours the reviewer re‑samples the closed clusters at 100 % to ensure the problem does not recur, re‑triggering the pipeline if it does.

Task 6 – Reporting : A nightly summary is posted to Linear and the team channel, listing detected clusters, opened/closed PRs, score changes per category and a model leaderboard.

Component 3 – Bridge Layer: AI‑Gated Canary Release

The bridge layer uses reviewer scores as one of the release gates. When a major Agent change lands, a small traffic slice (typically 10 %) is routed to the new variant. The reviewer scores the new variant against the production baseline in real time.

Failure : If the average score drops by ≥0.15 (p < 0.05 over ≥200 interactions) or a deterministic bug hunter detects a surge in new error clusters, the release is halted, traffic reverts to the stable version, and a Linear ticket is opened for further triage.

Maintain or Improve : Traffic is gradually expanded (5 % → 20 % → 50 % → 100 %) with the same statistical checks at each step.

This approach eliminates pre‑release environments, manual approvals, and subjective PR comments, letting data‑driven signals control safety.

Harsh Truths About Running the Harness

Focus on results, not paths. Early attempts penalized unnecessary tool calls, but research shows that AI often finds non‑linear, highly effective solutions that appear odd to humans. Scoring the final output proved more robust than micromanaging the execution path.

Sample by model, not by traffic. Uniform traffic sampling would drown out minority models; model‑based sampling ensures each model receives statistically significant exposure.

A score without a ticket is meaningless. Reviewer scores must feed an engineering pipeline; otherwise they are just dashboard noise, and the pipeline without scores has no actionable signal.

New Standards

The self‑repair system is not a single feature but a loop—review, triage, fix, verify—where each component runs on model outputs. The reviewer replaces subjective human QA, the engineering pipeline replaces manual bug triage and regression testing, and the bridge layer removes the anxiety of large‑scale releases.

Founders who keep the old CI/CD + manual QA workflow while using AI to write code are still “AI‑assisted” rather than “AI‑first.” Competitive advantage belongs to teams that fuse evaluation and QA into a single harness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents automated testing continuous deployment self-repair evaluation pipeline AI-driven QA

Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Core Argument: Evaluation and QA Are the Same Loop

Component 1 – Reviewer: A Real‑Traffic Three‑Judge Panel

Component 2 – Engineering Pipeline: Six Daily Tasks

Component 3 – Bridge Layer: AI‑Gated Canary Release

Harsh Truths About Running the Harness

New Standards

AI Tech Publishing

How this landed with the community

Was this worth your time?

0 Comments

Component 1 – Reviewer: A Real‑Traffic Three‑Judge Panel

Component 2 – Engineering Pipeline: Six Daily Tasks

Component 3 – Bridge Layer: AI‑Gated Canary Release