Artificial Intelligence 10 min read

Unsupervised Domain Adaptation with Pixel-level Discriminator for Image-aware Layout Generation

Alibaba Mama’s team introduces PDA‑GAN, an unsupervised domain‑adaptation framework employing a lightweight pixel‑level discriminator to align repaired and original image features, enabling image‑aware layout generation that outperforms prior methods on visual‑quality and layout metrics for advertising creatives.

Alimama Tech
Alimama Tech
Alimama Tech
Unsupervised Domain Adaptation with Pixel-level Discriminator for Image-aware Layout Generation

This article shares the exploration and practice of Alibaba Mama’s content platform and intelligent creation team on automatic element layout for image‑text advertising, leveraging unsupervised domain adaptation to improve image‑aware layout generation. The work has been accepted by CVPR 2023.

Paper: Unsupervised Domain Adaption with Pixel-level Discriminator for Image-aware Layout Generation

Link: https://arxiv.org/abs/2303.14377

1. Background

In advertising, visual appeal of a creative is positively correlated with click‑through rate. Existing automated creative solutions rely on fixed‑template element replacement, which often leads to occluding the product image, poor visual integration, and lack of uniqueness. Academic layout‑generation methods mainly model relationships among layout elements and ignore image content, thus cannot solve these issues.

To address this, we propose an image‑aware layout generation approach. We collect rendered poster images with annotated element positions, remove the elements via inpainting, and apply Gaussian blur to reduce the domain gap between cleaned product images and repaired posters. This method works but introduces artifacts and can degrade color and texture details.

We further introduce unsupervised domain adaptation with a pixel‑level discriminator (PD) in a GAN (named PDA‑GAN) to align the domains more finely. PD consists of only three convolutional layers, making it lightweight while enabling the model to perceive fine image details for more accurate layout generation.

2. Method

As shown in Figure 1, the network comprises two sub‑networks: (1) a layout generator that takes an image and its saliency map as input to produce a graphic layout, and (2) a pixel‑level domain discriminator PD.

The layout generator follows the architecture of [4] and includes a multi‑scale CNN for image feature extraction, a transformer encoder‑decoder to model relationships between layout elements and the image, and two fully‑connected layers to predict element categories and bounding boxes.

PD is built from three 3×3 deconvolution layers with stride 2, taking features from the first residual block of the CNN and outputting a map of the same spatial size as the input image. During training, the discriminator learns to detect repaired pixels, while the generator is guided to produce shallow feature maps that fool the discriminator, thereby aligning source and target domain feature spaces.

Losses are computed using a binary mask indicating repaired pixels (1 for source domain, 0 for target domain). We apply one‑sided label smoothing only to non‑repaired regions (value 0.2 instead of 0). The generator loss is penalized when the discriminator outputs 1 for any pixel, encouraging the generator to produce layouts that the discriminator cannot distinguish as repaired.

The overall generator loss also includes a reconstruction loss between generated and ground‑truth layouts, identical to that in [4].

3. Experiments

We compare PDA‑GAN with state‑of‑the‑art methods on both image‑related metrics (background complexity, occlusion of main subject, occlusion of product) and layout‑only metrics (element overlap, baseline usage, alignment). PDA‑GAN achieves the best performance on most metrics, especially the image‑related ones. Ablation studies further validate the effectiveness of the proposed architecture and loss design.

About Us

We are Alibaba Mama’s content platform and intelligent creation team, focusing on AI‑driven creative production for images, videos, and copy, as well as multi‑channel short‑video ad delivery. We welcome collaborations and talent with backgrounds in CV, NLP, or recommendation systems. Contact: [email protected]

computer visionGANLayout Generationpixel-level discriminatorunsupervised domain adaptation
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.