How Frontend Code Is Automatically Generated: Inside Alibaba’s AI‑Powered D2C Pipeline
This article explains Alibaba's front‑end intelligent project that automatically generated 79.34% of the Double‑11 UI code, detailing why images are used as input, the layered image‑processing pipeline, background and foreground analysis, traditional versus deep‑learning methods, fusion techniques, evaluation results, and real‑world deployments.
Overview
As one of the four technical directions of Alibaba’s Front‑End Committee, the intelligent front‑end project passed the 2019 Double‑11 test, automatically generating 79.34% of the code for the Tmall‑Taobao Double‑11 venue. This article shares the technical ideas and challenges behind the automatic front‑end code generation.
Why Use Images as Input
Design drafts are hard to convert to UI code manually; using images from Sketch or Photoshop provides deterministic information while keeping the pipeline independent of upstream tools. Images also allow handling of layout types that do not exist in the design (e.g., listview, gridview) and support broader scenarios such as automated testing.
Layer Processing Layer
The D2C layer‑processing capability identifies element categories and extracts styles, feeding the subsequent layout algorithm layer.
Layout Analysis
Layout analysis separates foreground UI fragments from background using machine‑vision algorithms. Background analysis extracts color, gradient direction and connected regions; foreground analysis uses deep‑learning to merge and recognize GUI fragments.
Background analysis: machine‑vision extracts background color, gradient direction, and connected regions. Foreground analysis: deep‑learning merges and recognizes GUI fragments.
Background Analysis
Step 1: Detect background blocks with Sobel, Laplacian, Canny edge detectors and compute gradient direction. The discrete Laplacian template is:
Step 2: Use flood‑fill (water‑fill) to remove noise from gradient backgrounds.
def fill_color_diffuse_water_from_img(task_out_dir, image, x, y, thres_up=(10,10,10), thres_down=(10,10,10), fill_color=(255,255,255)):
# get image height and width
h, w = image.shape[:2]
# create a mask of size (h+2, w+2)
mask = np.zeros([h + 2, w + 2], np.uint8)
# flood fill
cv2.floodFill(image, mask, (x, y), fill_color, thres_down, thres_up, cv2.FLOODFILL_FIXED_RANGE)
cv2.imwrite(task_out_dir + "/ui/tmp2.png", image)
return image, maskThe resulting image after background processing is shown below.
Foreground Analysis
Foreground analysis uses connected‑component analysis to avoid fragmenting components, then machine‑learning to classify component types and merge fragments iteratively until no small features remain. An example of a complete item extracted from a waterfall‑flow layout is shown.
Traditional vs. Deep‑Learning Methods
Traditional edge‑gradient or connected‑component methods have high precision and speed but low recall. One‑stage detectors (YOLO, SSD) have high recall but lower localization accuracy; two‑stage detectors (Faster R‑CNN) achieve higher mAP at the cost of speed. A fusion of both methods can obtain high precision, recall and localization.
Run traditional and deep‑learning pipelines in parallel to obtain trbox and dlbox.
Filter trbox by IOU with dlbox > 0.8.
Filter dlbox by IOU with filtered trbox > 0.8.
Adjust dlbox edges toward the nearest straight line within a pixel threshold, without crossing trbox edges.
Output the fused boxes.
Evaluation
On 50 Xianyu waterfall‑flow screenshots (96 cards), traditional methods detected 65 cards, deep‑learning 97, and the fused approach 98, achieving higher precision, recall and IOU.
Complex Background Content Extraction
Complex background extraction aims to retrieve specific content (text, overlay layers) from noisy backgrounds. Traditional image processing struggles with accuracy and recall; semantic segmentation cannot recover occluded pixels. The proposed solution combines object‑detection for content recall and a SR‑GAN to restore foreground elements.
Why Use GAN?
SR‑GAN preserves high‑frequency details via a feature‑map loss, reduces false detections with adversarial loss, and can restore pixel values of transparent overlays—something semantic segmentation cannot do.
Training Pipeline
Business Applications
The method is deployed in the imgcook image pipeline (73%‑92% accuracy) and Alibaba’s automated testing for Double‑11 modules (over 97% accuracy and recall).
Future Work
Plans include richer layout recognition (listview, gridview, waterfall), improving accuracy for small objects with FPN and Cascade, expanding to more pages, and building an image‑sample generator to lower integration cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
