Building More Realistic Mobile Agent Worlds for Large‑Scale Training
The article examines the PhoneWorld project, which reconstructs realistic Android app environments from user interaction traces to create scalable, resettable, and verifiable mock apps, enabling large‑scale training and evaluation of Mobile Agents with demonstrated performance gains across multiple benchmarks.
Background
Mobile Agents have progressed rapidly in the past year, moving from screen reading and button clicking to completing cross‑app tasks. However, further scaling is limited not only by model capacity but also by the training environment, which determines data sources, executable actions, result verification, and reproducibility.
Why Real Apps Are Not Sufficient
Real Android apps are closest to the target user scenario, yet they pose three major challenges for large‑scale training:
State Reset Difficulty: Operations such as adding to a cart, sending messages, or changing settings permanently alter account and app state, making repeated task execution costly.
Result Verification Difficulty: Determining whether an Agent truly completed a task requires reliable verifiers, but internal app states are often inaccessible.
Unstable Noise: Login status, risk controls, CAPTCHAs, permission dialogs, ads, network fluctuations, and version updates introduce unpredictable variations.
Consequently, while real apps provide the most authentic environment, they are not ideal for reproducible, scalable training and evaluation.
PhoneWorld’s Approach
PhoneWorld, a collaboration led by Tencent Hunyuan with Hong Kong‑Shenzhen, Renmin University‑Gaoling, and Wuhan University, proposes a middle‑ground solution: reconstructing the essential usage structure of real apps and converting it into a mock Android app that is runnable, resettable, and verifiable.
The pipeline consists of three steps:
Analyze screenshots and interaction trajectories to recover page hierarchy, navigation paths, and state‑changing actions.
Generate page‑level PRDs, data schemas, and reusable components that describe layout, interactive elements, navigation logic, and visual attributes.
Use a coding agent to automatically implement the mock app in Kotlin/Jetpack Compose, compile it into an APK, and subject it to automated testing and human audit.
This process preserves the most important interaction paths while avoiding a full replica of every app feature.
Mock App Construction Details
PhoneWorld creates two data layers within each mock app:
Read‑only content: Items such as products, posts, contacts, locations, videos, and music that support browsing, search, and information retrieval.
Mutable state: Entities like favorites, shopping carts, messages, comments, and orders that can be written to a local database by the Agent.
The mutable layer enables the environment to record actions, reset to an initial state after each task, and provide verifiable outcomes.
Verification Mechanism
Each task is paired with a verifier:
For information‑retrieval tasks, the system checks whether the final answer contains the correct value.
For state‑changing tasks, the system queries the local database to confirm that the expected changes (e.g., a message sent or an item added to the cart) have been persisted.
Mock apps are installed on emulators, run through automated tests, and undergo manual audits that compare them with the original real apps to ensure fidelity of pages, interaction paths, and state transitions.
Experimental Results
The PhoneWorld paper (https://arxiv.org/abs/2605.29486) reports four key experiments:
Replacing part of the original AndroidWorld auxiliary data with 10 K PhoneWorld steps improves four benchmarks: HYMobileBench +17.7, AndroidControl +6.0, AndroidWorld +14.7, PhoneWorld +52.5.
Using only PhoneWorld data (full replacement) continues to boost PhoneWorld benchmark performance while maintaining gains on HYMobileBench and AndroidControl, though AndroidWorld performance drops, indicating complementarity rather than outright replacement.
Scaling step data: increasing PhoneWorld supervision from 0 to 10 K, 20 K, and 36 K raises task success rates from 14.2 % to 64.2 %, 70.0 %, and 73.3 % respectively, showing that more verified trajectories yield continual benefits.
Scaling app diversity: with a fixed 10 K training budget, adding apps from 5 to 34 further improves all four benchmarks, demonstrating that greater environment diversity also contributes to performance gains.
PhoneWorld has built 34 mock Android apps covering 16 consumer‑level mobile domains, with 120 manually audited evaluation tasks, 3 354 successful trajectories, and 36 193 interaction steps.
Conclusions
PhoneWorld demonstrates that mock environments reconstructed from real usage traces can provide scalable, verifiable training data that complements real‑app data, leading to consistent improvements across multiple Mobile Agent benchmarks. The next frontier for Mobile Agents is not merely better screen‑clicking ability but the availability of sufficiently realistic, large‑scale worlds for training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
