StyTr²: A Transformer‑Based Approach for Image Style Transfer

The paper proposes StyTr², a Transformer‑based image style transfer method that uses content‑aware positional encoding to preserve details and improve feature representation, achieving high‑quality stylization with better content structure and style patterns.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
StyTr²: A Transformer‑Based Approach for Image Style Transfer

Image style transfer is a practical and interesting task that uses a reference style image to render content images, having been extensively studied in academia and widely deployed in industry, such as in short‑video applications.

Traditional texture‑synthesis methods can produce vivid results but are computationally complex due to modeling brush strokes and the painting process. Subsequent CNN‑based neural style transfer optimizes an encoder‑decoder pipeline, yet its limited receptive field forces deep networks that lose fine‑grained details, and the content representation tends to leak, causing the original structure to disappear after repeated stylization.

Transformer architectures overcome these limitations: self‑attention enables each layer to capture global information, and the relational modeling preserves structural details, yielding stronger feature representation without the detail loss seen in CNNs.

Building on this, the authors propose StyTr², a novel image style transfer algorithm that mitigates the content‑expression bias of CNN‑based methods. The model consists of a content Transformer encoder, a style Transformer encoder, and a Transformer decoder.

The content and style encoders respectively encode long‑range dependencies of the content and style images, effectively avoiding detail loss, while the decoder transforms the content features into a stylized output infused with style characteristics.

To further suit visual tasks, the paper introduces Content‑Aware Positional Encoding (CAPE), which is scale‑invariant and semantically aware, addressing the mismatch between traditional sinusoidal positional encoding (designed for sequential sentences) and the spatial semantics of image patches.

Experimental comparisons show that StyTr² achieves higher‑quality stylization than state‑of‑the‑art approaches, preserving both content structure and rich style patterns, and remains stable across multiple stylization iterations.

The work advances the frontier of image stylization and demonstrates the effectiveness of Transformers in vision tasks; the paper is available at https://arxiv.org/abs/2105.14576 and the implementation at https://github.com/diyiiyiii/StyTR-2.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionDeep Learningimage style transfercontent-aware positional encoding
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.