Artificial Intelligence 16 min read

MiniAppBench Reveals Only 1 in 6 AI‑Generated Apps Meet Real User Needs

MiniAppBench, the first benchmark that evaluates large language models' ability to generate fully functional interactive HTML applications, shows an average pass rate of just 17% across 16 top models—with the strongest model, GPT‑5.2, achieving only 45%—highlighting a substantial gap between current capabilities and real‑world user requirements.

Machine Heart

Jun 10, 2026

MiniAppBench Reveals Only 1 in 6 AI‑Generated Apps Meet Real User Needs

From Text to Interactive Apps: A New AI Paradigm

The paper defines a MiniApp as a customized HTML interactive application generated on‑the‑fly from a single user query, moving beyond static text or markdown outputs.

Why HTML?

HTML provides visual appeal, rich interaction logic, cross‑platform immediacy, and no installation overhead, making it a natural target for direct model output rather than an intermediate artifact.

Core Requirements of a MiniApp

Principle Adherence : the model must capture implicit real‑world principles in the query (e.g., a weekly diet tracker must reflect seven days and three meals per day).

Customized Interaction : the app’s structure and behavior must be dynamically synthesized to match user intent, not merely assembled from a fixed template.

Why Existing Benchmarks Fall Short

Code benchmarks (HumanEval, MBPP, SWE‑Bench) test algorithmic correctness but ignore execution environment and user interaction. Web generation benchmarks (Pix2Code, FullFront, WebGenBench) assess visual fidelity only. Agent benchmarks (ArtifactsBench, WebDevJudge) rely on fixed A/B comparisons or reference implementations, which cannot capture the open‑ended nature of MiniApps.

MiniAppBench Construction

Starting from over 10 million real interaction requests, the authors filtered for tasks that require principle‑driven interaction, yielding 500 high‑quality tasks covering six domains, 25 sub‑categories, and three difficulty levels (30% Easy, 40% Mid, 30% Hard). The pipeline consists of four stages:

Identify principle‑driven interactive demands.

Expand coverage while preserving original intent.

Generate structured evaluation references (Eval‑Ref) that list intent, static checks, and dynamic checkpoints.

Balance difficulty and domain distribution.

MiniAppEval: LLM Agent as Human Tester

MiniAppEval drives a headless browser with Playwright, letting an LLM‑based agent click, type, and drag while capturing DOM state, console logs, and source code. Evaluation spans three dimensions, each scored 0–1:

Intention : does the app satisfy the user’s goal?

Static : is the HTML structure correct, syntactically valid, and accessible?

Dynamic : do multi‑step interactions behave consistently and respect real‑world constraints?

An app passes only if all three scores are ≥ 0.8. Ablation studies show that removing any component (Eval‑Ref, code review, or dynamic testing) drastically reduces recall or precision, confirming their necessity.

Evaluation Results

Sixteen models were tested on the 500 tasks. The overall average pass rate is 17.05%.

Best closed‑source model GPT‑5.2 achieves 45.46%.

Best open‑source model GLM‑4.7 reaches 18.31%.

Hard‑level tasks cause most models to collapse to single‑digit pass rates (e.g., GPT‑5.1 drops to 3.49%).

Visualization and Lifestyle domains have higher pass rates (>30%) than Science and Tools, which remain challenging.

Token consumption correlates strongly with pass rate (r = 0.84), but models with similar performance can differ by several times in token usage, indicating that raw compute alone does not explain the gap.

Implications

MiniAppBench demonstrates that current LLMs are far from reliably generating usable interactive applications. The benchmark also introduces a methodology—LLM agents combined with static code review and dynamic browser testing—that can be adapted to other open‑ended generation tasks.

Getting Started

MiniAppBench is fully open‑source (https://github.com/MiniAppBench/miniappbench). With an OpenAI‑compatible API key, the evaluation can be run locally in about five minutes.

# Install dependencies
pip install -r requirements.txt
playwright install chromium
# Run the benchmark
python -m examples.pipeline --query-file data/query_validation_100.json --index 1

Paper title: MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM‑Powered Assistants (arXiv: https://openreview.net/pdf?id=pwbLmew1aq).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM benchmark AI evaluation agent testing interactive HTML MiniAppBench

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.