Artificial Intelligence 17 min read

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

This article explores multimodal representation learning by introducing a Bilinear Residual Layer that automatically fuses image and text features, demonstrates its superiority over traditional concatenation and FiLM methods on text‑guided image editing and fashion synthesis tasks, and reports state‑of‑the‑art results on several benchmark datasets.

Alibaba Cloud Developer

Apr 10, 2019

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

Abstract

Recent advances in deep learning have dramatically improved performance in vision, speech, and language tasks. Multimodal representation learning has entered the deep learning era, and many fusion strategies have been proposed. This work focuses on the two most common modalities—image and text—introducing a Bilinear Residual Layer (BRL) that learns a superior fusion of their features and achieves state‑of‑the‑art results on two multimodal tasks.

Text‑Guided Image Editing

Traditional image editing software such as Photoshop requires expert knowledge and is time‑consuming. Text‑guided image editing aims to modify an input image according to a natural‑language description, e.g., changing a garment’s color or style. Existing methods typically use a conditional GAN where the text embedding is concatenated with image features, which limits the expressive power of the fusion.

Existing Methods

Hao et al. use a conditional GAN that concatenates text embeddings with image features.

Mehmet et al. replace concatenation with Feature‑wise Linear Modulation (FiLM), reducing parameters and improving representation learning.

Our Contribution

We theoretically analyze the representational capacity of concatenation and FiLM, and generalize them to a bilinear form. Based on this analysis we propose the Bilinear Residual Layer, which automatically learns an optimal fusion of image and text features.

Figure 1 illustrates the original conditioning form, while Figure 2 and Figure 3 show conditional GAN and FiLM‑based implementations respectively.

Our Method

We model the fusion as a bilinear representation: the output tensor is obtained by a bilinear interaction between image and text feature vectors, followed by a residual connection. This formulation subsumes FiLM as a special case and provides greater expressive power.

Figure 4 shows the overall network architecture: a generator composed of an encoder (pre‑trained text encoder and VGG‑16 image feature extractor), four BRL fusion modules, and a decoder that upsamples the fused features to produce the edited image.

Experiments

We evaluate our method on three datasets: Caltech‑200 bird, Oxford‑102 flower, and Fashion Synthesis. Qualitative results (Figure 5) show that the baseline conditional GAN and FiLM methods fail on complex images, while our BRL‑based approach produces high‑quality edits.

Quantitatively, we use the Inception Score (IS) as an approximate metric. Table 6 shows that our method achieves the highest IS, especially when the bilinear matrix rank is set to 256.

Fashion Image Generation

In fashion synthesis, the goal is to generate high‑resolution images that match a textual description. We adopt a fine‑grained cross‑modal attention mechanism that aligns each word with relevant image regions, enabling detailed control over garment attributes.

The generator operates in two stages: the first stage learns a coarse global structure from the sentence embedding, and the second stage refines local details using word‑level semantics. This approach won the first place in the FashionGEN competition, achieving the best objective and subjective scores.

Conclusion

We investigated multimodal fusion for image‑text tasks. For text‑guided image editing we introduced the Bilinear Residual Layer, which outperforms concatenation and FiLM and has been published at ICASSP 2019. For fashion image synthesis we employed a fine‑grained cross‑modal attention strategy that secured the championship in the FashionGEN challenge.

GAN text-to-image generation Multimodal Learning bilinear residual layer cross-modal attention

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.