Artificial Intelligence 11 min read

AI‑Driven Visual Automation Testing Frameworks: Challenges, Opportunities, and the Aion Solution

The article examines shortcomings of traditional visual automation frameworks—weak cross‑platform support, ID dependence, and fragile screenshot matching—and shows how Aion’s hybrid approach, merging image‑processing segmentation with deep‑learning classification and OCR, delivers a more stable, cross‑platform, “visible‑to‑obtain” testing solution while acknowledging remaining accuracy challenges.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
AI‑Driven Visual Automation Testing Frameworks: Challenges, Opportunities, and the Aion Solution

When evaluating an automated testing framework, we usually consider criteria such as ease of development and maintenance of test code, high stability, and sufficient execution efficiency. After these basic requirements are met, we also look for cross‑platform, quick‑app capabilities and support for hybrid applications. Traditional testing frameworks each have their own strengths and weaknesses.

The majority of platform‑provided frameworks lack cross‑platform ability. Although Appium supports cross‑platform testing, it is essentially a wrapper over different platforms and, because it relies on a protocol‑based approach, its test code is harder to maintain than that of other frameworks. In summary, traditional frameworks suffer from the following shortcomings:

Weak cross‑platform capability

Weak cross‑application capability

Stability heavily dependent on element IDs

High cost of UI element capture

Unreliable dump of system view trees

In this context, testers desire a “visible‑to‑obtain” automation framework that makes test code resemble the way users interact with the app. Earlier attempts such as Sikuli and NetEase’s AirTest rely on image‑based screenshot matching for clicks and verification.

While Sikuli (a legacy PC‑era framework) and AirTest (a recent NetEase offering) are useful, screenshot‑matching approaches have drawbacks: insufficient matching accuracy, lack of hierarchical structure, and instability caused by color shifts or background changes, all of which reduce code maintainability.

Ideally, a test script could translate a natural language instruction like “click the ‘Member’ tab” into a call such as find('tab').find('Member').click() . To achieve this, the following basic capabilities are required:

Image segmentation

Image classification

OCR (optical character recognition)

Image similarity matching

Pixel‑level operations

Some of these capabilities already exist in traditional techniques, while others have become feasible thanks to recent advances in deep learning.

Opportunities Brought by Deep Learning

The ImageNet ILSVRC results show that before 2012, deep learning had not been applied to image classification, and accuracy stayed below 75%. After deep learning was introduced, classification accuracy rose dramatically, reaching 98‑99% in recent years—far surpassing human visual discrimination.

OCR technology has also progressed: sentence‑level recognition can achieve about 93% accuracy, and single‑character recognition can reach 98%. Although recall is not perfect, the workflow of cropping images first and then applying OCR makes accuracy the primary concern.

With these foundational abilities, we began building a “visible‑to‑obtain” testing framework. The industry concept of UI2Code—reverse‑generating a view tree from a screenshot— aligns with our technical goals.

PixelToApp

PixelToApp is a project based on traditional image‑processing techniques. It converts a screenshot into a layout by first using OCR to locate text regions, then extracting element positions via image processing, merging the two sources to obtain a more accurate layout, and finally applying algorithms to detect lists, grids, etc. This approach suffers from limited complex image‑processing capability and poor generalization of the algorithms.

Pix2Code

Pix2Code is an end‑to‑end deep‑learning solution that gained attention on GitHub. Its model first passes an image through a CNN to obtain a feature vector p , then encodes the corresponding layout description as a sequence processed by an LSTM to produce a vector q . Vectors p and q are concatenated and fed into another LSTM followed by a SoftMax layer to generate the next token. The reported accuracy of this approach is only 60‑70% on simple interfaces, and more recent end‑to‑end attempts still struggle with low overall accuracy.

The main issue with pure deep‑learning end‑to‑end methods is the lack of intervenable steps; poor results often require redesigning or retraining the model, which is time‑consuming and incurs high annotation costs.

Aion’s Hybrid Solution

After comparing traditional image‑processing and pure deep‑learning approaches, we adopted a hybrid solution that combines both. First, image‑processing techniques perform segmentation; then image classification identifies segment categories; sub‑elements are extracted; finally OCR and classification recognize attributes to populate a secondary view tree. This secondary tree enables the desired click and verification actions.

The overall test‑case execution flow (illustrated in the original slides) applies AI for scene judgment, classification, and attribute filling, while seamlessly supporting legacy frameworks to ease migration from traditional test cases.

Aion’s Deep‑Learning Optimizations

We selected MobileNetV2 for its high inference speed at acceptable accuracy. To address imbalanced training data, we generated additional scripts and collected more app screenshots (e.g., from iQIYI) to create a more balanced dataset. Accuracy improvements were pursued by switching from top‑layer transfer learning to full fine‑tuning, and by employing multi‑model ensembles.

Advantages and Limitations of Aion

Compared with traditional frameworks, Aion offers clear benefits:

Visible‑to‑obtain; easy to understand and develop

Weak dependency on system frameworks; cross‑platform

Strong stability; no ID‑confusion issues

Shallow hierarchy; simple view capture

Seamless support for legacy frameworks

Remaining challenges include further improving recognition accuracy, especially for elements with minimal visual features (e.g., thin input lines or cursors), and handling scenes where foreground and background blend heavily, making sub‑element extraction difficult.

Conclusion

AI techniques have indeed solved problems that were previously intractable for our team. The Aion framework overall outperforms existing testing frameworks. Moreover, mobile testing still presents many scenarios where AI can add value, such as page anomaly detection, user‑behavior prediction, and page pre‑loading.

UI2Codedeep learningOCRMobile TestingImage RecognitionAI testingvisual automation
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.