Artificial Intelligence 14 min read

ml-evolve: Multi‑Agent Self‑Evolving System Built on Real‑World ML Pitfalls

ml-evolve addresses the shortcomings of generic agent‑search frameworks for machine‑learning pipelines by introducing four specialized agents, staged data gating, and cost‑saving mechanisms, and demonstrates its advantages with a two‑tower retrieval case study and concrete performance metrics.

DeepHub IMBA

Jun 5, 2026

ml-evolve: Multi‑Agent Self‑Evolving System Built on Real‑World ML Pitfalls

Why ML needs its own agent system

Generic agent‑search frameworks (AlphaEvolve, OpenEvolve, AutoResearch) work for code‑level tasks where candidates are evaluated in seconds, but ML training runs take minutes to hours, require evaluation on real data, involve architecture, feature pipelines, and sampling strategies, and consume large compute budgets. Consequently a generic loop quickly fails—for example, an LLM can write a sorting function in a few tries but cannot improve a recommendation‑system architecture after many iterations.

Four pain points and corresponding agents

We enumerate failure modes and design a minimal topology:

1. LLM cannot do both architecture reasoning and numeric optimization – solution: split into a Mutation Agent that rewrites the EVOLVE block and parameterizes knobs (but does not pick numeric values), and a Param‑Search Agent that runs Optuna TPE on the frozen architecture (e.g., 12 trials with 5 warm‑starts on pilot data).

2. Search collapses into shallow fine‑tuning – solution: add a Plan Agent that controls breadth, reads the leaderboard, can fetch web information, assigns hypotheses to slots, and is awakened every K rounds to rewrite the research plan.

3. Training cost dwarfs other stages (100‑10 000×) – solution: three‑stage gating: search on ~10 % pilot data, promotion on full data, and final reporting only. Example: a 3‑node × 15‑iteration × 12‑trial job would need 360 GPU‑hours on full data; after gating it uses ~36 GPU‑hours, a ten‑fold saving.

4. Noisy pilot leaderboard misleads the loop – solution: enforce explicit rules: param‑search uses tpe_startup_trials for warm‑start, the controller promotes only top‑K after full‑data re‑evaluation, the plan agent monitors branch health, and scores are annotated with their stage.

Self‑evolution loop

The agents are arranged in a directed graph:

┌─────────────────────────────────────┐
                │  PLAN AGENT  (algorithm breadth)      │
                │  reads leaderboard + branch health   │
                │  + optional web search; each node    │
                │  gets a hypothesis                    │
                └──────────────┬──────────────────────┘
                               │
                               ▼
                ┌─────────────────────────────────────┐
                │  MUTATION AGENT (architecture)        │
                │  rewrites EVOLVE block; parameterizes │
                │  numeric knobs (does not select values)│
                └──────────────┬──────────────────────┘
                               │ frozen architecture
                               ▼
                ┌─────────────────────────────────────┐
                │  PARAM‑SEARCH AGENT (numeric)         │
                │  Optuna TPE, N trials on PILOT (10% data)│
                └──────────────┬──────────────────────┘
                               │ best score
                               ▼
                ┌─────────────────────────────────────┐
                │  PROMOTION (every K rounds)           │
                │  top‑K pilot → full data re‑eval;    │
                │  noise filtering, leakage barrier    │
                └──────────────┬──────────────────────┘
                               │ update leaderboard
                               ▼
                └─► back to PLAN AGENT

Each round the plan agent reads the leaderboard and branch health, optionally searches the web, and issues a hypothesis. The mutation agent rewrites the EVOLVE block, parameterizing knobs but not picking values. The param‑search agent runs Optuna TPE trials on the pilot dataset. Promotion periodically promotes the top‑K pilot candidates to full‑data evaluation, filters noise, and updates the leaderboard, which becomes the next input for the plan agent.

Comparison with existing self‑evolution algorithms

AlphaEvolve, OpenEvolve and the broader AK‑style AutoResearch share a simple loop: propose → evaluate → learn → repeat, driven by a leaderboard. ml‑evolve differs in three ways: it redesigns the loop for production‑grade ML economics, adds explicit stage gating, and introduces role‑specific agents to avoid compute waste and evaluation leakage.

Two‑tower retrieval case study

Scenario: improve a two‑tower retrieval pipeline (baseline TwoTower). Three structural hypotheses were explored: in‑batch negative tuning, ANN‑based hard‑negative mining, and a multi‑interest user tower. In round 6 (of 15) the system executed:

[Stage]  search  (10% data pilot)

[Branch] hard_negative_mining
 Plan agent     → "Round 5 converged at temperature 0.05. Meta two‑tower notes recommend BM25‑retrieved hard negatives + curriculum schedule. Direction: add hard_neg_ratio and pool_size."
 Mutation agent → edits EVOLVE block, parameterizes the two knobs.
 Param‑search   → 12 Optuna TPE trials, 5 startup.
               best → Recall@50 = 0.1112 (pilot)
               cost: 12 × 4 min = 48 min on one GPU

[Branch] in_batch_neg_tuning           best → 0.1071
[Branch] multi_interest_user_tower     best → 0.1058 (2/3 kill criterion)

[Promotion] every=5 → round‑5 top‑1 promoted; full‑stage check
               → Recall@50 = 0.1098 (consistent with pilot — not noise)

[Replan]   replan_every=5 fires → multi_interest_user_tower flagged for hypothesis refresh next round.

Key observations:

Search decoupling – the plan agent never selects temperature; Optuna handles it.

Breadth control – three parallel branches, one pruned early.

Multi‑stage training – pilot 48 min vs full‑data ≈ 8 h, a 90 % cost reduction without depth loss.

Noise filtering – promotion uses full data to confirm pilot leaders, preventing noisy candidates from proceeding.

Artifacts produced include leaderboard.md (best per branch + history), research_plan.md (current plan instructions), param_trials/*.json (each Optuna trial), and report.md (final audit report). These files record the decisions and outcomes for later review.

When to use ml‑evolve

Suitable scenarios:

Single‑scalar objectives (Recall@K, NDCG@K, AUC, profit, etc.).

Training tasks long enough that each candidate costs ≥ 10 minutes, allowing stage‑wise amortization.

Multiple genuinely different research directions that merit parallel exploration.

Need for quarterly auditability of what was tried and why.

Not suitable for pure hyper‑parameter grid search (Optuna suffices), online or human‑in‑the‑loop evaluation (this is an offline loop), or when training is cheap enough that the overhead of staging outweighs the savings.

Conclusion

ML‑specific agent systems are evolving toward more open and capable designs. The remaining challenge is to build a multi‑agent system that recovers its compute cost within a single quarter on real‑world ML pipelines. The answer lies not in larger LLMs but in better‑designed agent roles, staged computation, and separating breadth from depth.

Project repository: https://github.com/roylist/ml-evolve

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multi-agent AutoML ML pipeline Optuna ml-evolve two-tower retrieval

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.