Mobile Development 8 min read

Inside the AI Phone: How Bean Bag’s Device Bypasses Android Security and Captures Screens

An in‑depth analysis reveals that the Bean Bag AI phone avoids Android’s Accessibility Service by directly reading GPU buffers and injecting input events via hidden system APIs, runs a headless virtual screen, streams low‑resolution frames to the cloud for inference, and raises significant privacy and security concerns.

Sohu Tech Products

Dec 17, 2025

Inside the AI Phone: How Bean Bag’s Device Bypasses Android Security and Captures Screens

Background

The Bean Bag AI phone has attracted attention because major platforms such as WeChat and Taobao actively resist third‑party automation agents. A detailed video analysis by a B‑station creator ("老戴") breaks down how the phone’s AI assistant reads the screen, captures data, and simulates user actions.

Low‑level Screen Capture and Input Injection

Unlike typical Android automation that relies on the Accessibility Service, the device uses a deeper system implementation. It reads raw image data directly from the GPU buffer and injects input events through hidden system APIs. The autoaction APK holds the INJECT_EVENTS permission, allowing it to call injectInputEvent and simulate clicks with higher privileges than standard accessibility APIs.

Bean Bag phone utilizes low‑level system permissions, directly fetching raw image data from the GPU buffer and injecting input events, rather than relying on screenshots or accessibility services.

Key Processes

The analysis identifies a process named aikernel as the core AI engine on the device. Its native heap can occupy up to 160 MB, suggesting a local AI inference framework. Additionally, aikernel shows an unusually high number of Binder connections, indicating extensive RPC calls from external processes, reinforcing its role as a system‑level service.

Furthermore, aikernel exhibits an abnormal Binder count, proving many external processes invoke it via RPC, further confirming its system‑service role.

Virtual Screen and Cloud Inference

When performing automated actions, the phone creates a headless virtual screen that matches the physical display’s resolution. This virtual screen runs in the background with an independent focus, so user interaction on the foreground remains unaffected. The GPU‑composited frame is consumed by autoaction without invoking screenshot APIs.

Every 3–5 seconds, the device captures a ~250 KB frame and sends it to the cloud service at obriccloud.com (ByteDance’s backend). The cloud performs inference and returns a ~1 KB command indicating one of seven possible actions (open app, click, input, swipe, etc.). This architecture offloads most reasoning to the cloud, keeping the on‑device workload lightweight.

The AI phone streams low‑frequency images to the cloud for inference, receiving concise commands that drive local actions.

Security and Privacy Implications

Because the AI agent bypasses the Accessibility Service and DRM protections, it can capture protected video output, potentially circumventing anti‑screenshot measures in banking apps and other sensitive applications. This raises serious concerns about data privacy, as raw screen content is transmitted to remote servers, and about the broader security model of Android, where hidden APIs can be abused for high‑privilege automation.

Future Outlook and Regulation

The emergence of AI agents on mobile devices challenges the existing attention‑economy model and may reshape the commercial logic of the mobile internet. As these agents gain capabilities to harvest user time and data, there is a growing call for unified standards, regulatory frameworks, and possibly new Android APIs (e.g., the upcoming AppFunction API in Android 16) that explicitly define what AI‑accessible functionalities an app may expose.

AI agents could destabilize the mobile ecosystem’s underlying business logic, concentrating power and creating new monopolistic dynamics.

Overall, the Bean Bag AI phone demonstrates a sophisticated, low‑level automation pipeline that combines direct GPU access, hidden input injection, a headless virtual display, and cloud‑based inference, highlighting both technical ingenuity and significant privacy‑security challenges.

AI Mobile Automation Android Security Cloud Inference GPU Buffer

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.