Artificial Intelligence 9 min read

Can AI Fully Automate Advertising Poster Creation and Video Outpainting?

This article reviews four ACM MM 2023 papers that introduce AI‑driven systems for automatic advertising poster generation, multimodal text‑image creation, few‑shot style‑guided visual captioning, and hierarchical 3D diffusion models for video outpainting, detailing their methods, datasets, and experimental results.

Alimama Tech

Aug 2, 2023

Can AI Fully Automate Advertising Poster Creation and Video Outpainting?

AutoPoster: Automatic Content‑Aware Advertising Poster Generation

AutoPoster is a highly automated system that creates advertising posters from a single product image and description. It performs four key steps: image re‑targeting, layout generation, slogan generation on the image, and visual attribute prediction. Two content‑aware generative models handle layout and slogan creation, while a multi‑task visual style attribute predictor (SAP) jointly predicts visual style attributes. The authors also release a dataset of over 76,000 poster images with annotated visual attributes. User studies and quantitative experiments demonstrate that AutoPoster produces aesthetically superior posters compared with existing methods.

TextPainter: Multimodal Text‑Image Generation for Poster Design

TextPainter tackles pixel‑level text image generation on poster backgrounds, requiring harmony between visual and textual semantics. The model uses global‑local background images as style cues and incorporates visual harmony to guide generation. A language model provides sentence‑level and word‑level style variations, and a text‑understanding module is added. The authors construct the PosterT80K dataset with roughly 80,000 annotated posters containing sentence‑level bounding boxes and text content. Experiments show that TextPainter produces text images that are both visually harmonious and semantically consistent with the surrounding poster.

FS‑StyleCap: Few‑Shot Style‑Guided Visual Captioning

The FS‑StyleCap framework generates image or video captions in arbitrary styles using only a few style examples, without additional training. It consists of a conditional encoder‑decoder language model and a visual mapping module. Training proceeds in two stages: first, a style extractor is trained on unlabeled text corpora to learn style representations via denoising reconstruction, noisy back‑translation, and style classification; second, the extractor is frozen while the content extractor, generator, and visual projection module are trained on visual‑caption pairs to achieve cross‑modal alignment. During inference, users provide style exemplars, and the model produces captions that match the desired style. Evaluations show FS‑StyleCap outperforms strong baselines and rivals models trained on large labeled style corpora.

Hierarchical Masked 3D Diffusion Model for Video Outpainting

Video outpainting extends video borders to fit target aspect ratios, a common need in e‑commerce advertising. Challenges include maintaining temporal consistency across segmented inference and mitigating error accumulation in long videos. The proposed solution builds on a 2D diffusion prior (Stable Diffusion) to initialize a 3D video diffusion model for faster convergence. A guided‑frame strategy connects consecutive video segments, and a novel mask‑prediction method trains the 3D diffusion model. Global frame information is injected into cross‑attention layers to preserve temporal coherence. A coarse‑to‑fine inference pipeline reduces error buildup. The method has been deployed in Alibaba’s advertising platform for one‑click video size adjustment.