How to Systematically Build More Realistic Mobile Agent Environments for Large‑Scale Training

PhoneWorld reconstructs mock Android apps from real‑world usage traces, creating scalable, resettable, and verifiable environments that let Mobile Agents train on realistic page structures, navigation paths, and state changes, and the paper shows substantial gains across four mobile benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How to Systematically Build More Realistic Mobile Agent Environments for Large‑Scale Training

Motivation

Mobile agents have progressed rapidly, but further scaling is limited not only by model capacity but also by the training environment, which determines data sources, executable actions, verifiable outcomes, and reproducibility.

Why real apps are unsuitable for large‑scale training

State is difficult to reset after actions such as adding to cart, sending messages, or changing settings; reproducing the same task requires costly restoration of data, cache, and account state.

Task completion cannot be automatically verified because internal app state is usually hidden, making it hard to confirm that a message was sent, an item was added to the cart, or a setting was changed.

Real apps introduce noisy, unstable factors—login requirements, risk controls, permission dialogs, ads, network variability, and version updates—that cause the same task to follow different execution paths over time.

PhoneWorld approach

PhoneWorld extracts the usage structure from screenshots and interaction traces of real apps and converts it into a mock Android app that is runnable, resettable, and verifiable.

Construction pipeline

Analyze real user trajectories to identify visited pages, navigation links, and state‑changing actions.

Generate page‑level product requirement documents (PRDs) and data schemas that serve as blueprints.

A coding agent automatically produces a Kotlin/Jetpack Compose project, which is compiled into an Android APK.

Mock app design

The mock app retains only the most frequently traversed core paths—home, search, detail, chat, and order screens—rather than reproducing the entire original app. For each key page, a structured PRD describes layout, interactive elements, navigation logic, and visual attributes, guiding the coding agent on appearance and behavior.

Controllable data layer

Read‑only portion (e.g., product listings, contacts, videos) supports browsing, search, and information queries.

Mutable portion (e.g., favorites, shopping cart, messages, comments, orders) records agent actions in a local database and can be reset after each episode.

Verification

Each mock app is installed on an emulator and subjected to automated tests that verify core flows run without errors, followed by human audits that compare page structures, interaction paths, and state changes with the real counterpart.

Experimental evaluation

Replacing a portion of the auxiliary AndroidWorld data with 10 K PhoneWorld steps yields consistent improvements across four benchmarks: HYMobileBench +17.7, AndroidControl +6.0, AndroidWorld +14.7, and PhoneWorld +52.5.

Scaling experiments show that increasing PhoneWorld supervision from 0 to 10 K, 20 K, and 36 K steps raises task success rates from 14.2 % to 64.2 %, 70.0 %, and 73.3 % respectively, demonstrating continuous benefit from more verifiable trajectories.

Increasing the number of distinct mock apps from 5 to 34 further boosts performance on all benchmarks, confirming that environment diversity yields additional gains.

When PhoneWorld data fully replaces the auxiliary AndroidWorld data, PhoneWorld’s own benchmark continues to improve, HYMobileBench and AndroidControl retain significant gains, while AndroidWorld performance drops, indicating that PhoneWorld data complements rather than outright replaces real‑app data.

Conclusion

PhoneWorld provides a middle‑ground solution: it extracts realistic usage structures from real apps, converts them into mock environments that are runnable, resettable, verifiable, and scalable, and empirically demonstrates substantial performance enhancements for Mobile Agents across multiple tasks.

PhoneWorld overview
PhoneWorld overview
Construction pipeline
Construction pipeline
Scaling results
Scaling results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI trainingMobile Agentenvironment scalingmock Android appPhoneWorld
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.