Improving Small Object Detection for UI2CODE via Data Augmentation and Model Optimization
The study enhances UI2CODE’s ability to detect tiny UI components by augmenting training data with copied small objects, upgrading the detector from Faster RCNN to FPN and Cascade FPN, and refining box positions with smoothing and projection, achieving superior small‑object mAP/mAR and enabling broader UI parsing applications.
Background : In computer vision, detecting "small objects" (e.g., traffic lights in autonomous driving or early lesions in medical images) is critical for user experience and automation. UI2CODE, a tool from Xianyu Tech, parses UI elements from screenshots to generate code. Small UI components such as price tags or icons are often missed, leading to inaccurate code generation.
Challenges : According to COCO definitions, objects smaller than 32×32 pixels are small. The main difficulties are (1) class imbalance – small objects are scarce, biasing loss toward larger objects; (2) feature loss – pooling layers in deep networks discard fine‑grained details; (3) localization precision – a slight shift dramatically reduces IoU for small boxes.
Proposed Solutions :
1) Data Augmentation : Randomly copy and paste small objects into non‑overlapping positions, ensuring a 5‑pixel margin from image borders and applying up to 5% scale variation.
2) Model Optimization : Progressively upgrade the detector from Faster RCNN → Feature Pyramid Network (FPN) → Cascade FPN, leveraging multi‑scale features and staged IoU thresholds to improve recall and precision for small targets.
3) Position Correction : Apply Gaussian smoothing, adaptive local binarization, and horizontal/vertical projection on the binary mask to refine bounding boxes to pixel‑level accuracy.
Results : Experiments show that FPN and Cascade FPN outperform Faster RCNN on small‑object mAP/mAR across confidence thresholds (0.5–0.95). Cascade FPN yields the best overall metrics when the confidence threshold is 0.5.
Outlook : The small‑object detection pipeline forms the basis for UI element parsing and can be extended to other complex‑background analyses, generative up‑sampling, or class‑aware position refinement for higher precision.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.