Artificial Intelligence 12 min read

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

LongHorizonUI tackles the steep success‑rate drop of GUI agents on tasks longer than 10‑15 steps by introducing three tightly coupled modules—enhanced perception, deep reflective decision, and compensatory execution—and validates the approach on the new LongGUIBench benchmark with consistent performance gains across both app and game scenarios.

Machine Learning Algorithms & Natural Language Processing

Mar 12, 2026

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

In everyday mobile and desktop usage many workflows—such as booking meetings, purchasing items in game stores, or chaining actions across multiple apps—require dozens of consecutive interactions. Although multimodal large language model (MLLM) based GUI agents have progressed, their success rate collapses sharply once a task exceeds 10–15 steps.

To quantify this problem, the authors evaluated several state‑of‑the‑art methods on the AndroidControl benchmark, segmenting performance by step length. Methods achieve over 90% average success for sequences up to 5 steps, but success falls below 75% for >10 steps and drops to roughly 60% for >15 steps, revealing an inability to capture cross‑step state dependencies.

The central research question is: How can a GUI agent maintain contextual consistency and decision accuracy throughout long‑step operation sequences?

LongGUIBench: A Benchmark for Long‑Horizon Scenarios

To enable systematic evaluation, the team built LongGUIBench, a benchmark where every task contains at least 15 steps (average 22.1 steps). The dataset comprises two major categories:

General application scenarios : 15 mainstream apps (e.g., Gmail, YouTube) with 147 end‑to‑end task chains, average 19.5 steps, covering multi‑level menu navigation and real‑time input validation.

Game scenarios : 13 popular game apps recorded by professional testers, yielding 207 high‑complexity chains, average 23.7 steps, up to 37 steps, covering equipment management and event participation.

Each task provides two levels of instruction: a high‑level command describing the macro goal (e.g., “buy item X in the game store”) and a low‑level command decomposing it into atomic UI actions (e.g., “click store button → select purchase”). All steps are annotated with fine‑grained UI semantics, including widget type, bounding‑box coordinates, and state attributes. The benchmark contains 4,508 screenshots, aligned across modalities by six professionals.

LongHorizonUI: Three‑Module Unified Framework

The framework follows a perception‑decision‑execution loop, consisting of:

Multimodal Enhanced Perception (MEP) : Runs a UI component detector and an OCR module in parallel, assigning a unique spatial index ID to every UI element. An IoU‑based semantic binding mechanism links icon detections with OCR text boxes when their overlap exceeds a threshold, resolving ambiguities in composite widgets. A template‑matching repair strategy focuses on high‑priority regions to recover missed critical elements such as dialog close buttons.

Deep Reflective Decision (DRD) : Enforces a strict JSON‑Schema output that obliges the model to perform three levels of reasoning. First, it verifies the previous action’s success and expected UI state transition. Second, it extracts key information from the current screen and checks consistency with the task goal. Third, it generates an explainable action rationale, stating the observed screen state, the grounding reference, and the reason for the chosen operation. Before execution, DRD validates the existence of target elements and semantic alignment with the task description, rejecting any mismatched actions for correction.

Compensatory Actuator (CAE) : Translates the decision‑layer command into physical screen coordinates using a prioritized three‑stage locating strategy: (1) click the centroid of the indexed element; (2) if that fails, randomly sample a point inside the detection box for relative clicking; (3) as a last resort, click an absolute coordinate with a slight perturbation to handle edge occlusions. After each click, the MLLM re‑examines the new screenshot to confirm success. If all strategies fail, the system triggers local replanning; persistent failure leads to rollback to the last successful snapshot.

Experimental Results

On LongGUIBench, LongHorizonUI shows clear advantages on long‑horizon tasks. In general‑app scenarios, low‑level instruction success reaches 85.3% and high‑level instruction success 52.3%, improving over UI‑TARS‑1.5 by 6.1% and 30.5% respectively. In game scenarios, the framework attains 83.9% low‑level and 52.1% high‑level success, yielding an overall average of 77.3%.

On the ScreenSpot cross‑platform UI element localization benchmark, LongHorizonUI achieves a 90.4% average accuracy across Mobile, Desktop, and Web platforms, with a pronounced edge on icon elements, confirming the effectiveness of the IoU‑based semantic binding.

Ablation studies demonstrate the necessity of each module: removing the component detector reduces step‑completion rate by 6.1%; removing OCR causes a 2.3% drop and frequent errors on composite widgets; using only index‑based locating yields 81.4% success, which climbs to 85.3% when the compensatory strategies are added.

In the OSWorld 50‑step long‑chain setting, LongHorizonUI reaches a 29.4% success rate, surpassing UI‑TARS‑72B’s 24.6% by 4.8 percentage points, further validating robustness on ultra‑long sequences.

Conclusion

LongHorizonUI provides a comprehensive solution for long‑horizon GUI automation by integrating indexed perception, structured reflective decision making, and multi‑level compensatory execution. This design mitigates error accumulation across many steps and delivers consistent performance improvements on multiple benchmarks. The accompanying LongGUIBench benchmark also offers a standardized platform for future research on robust GUI agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark multimodal LLM GUI automation ICLR 2026 Long-Horizon Tasks

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.