Artificial Intelligence 10 min read

ArtCrafter: A Controllable, Diverse Style Transfer Framework from Tsinghua

ArtCrafter introduces a novel text‑image style transfer framework that leverages attention‑based style extraction, text‑image alignment enhancement, and explicit modulation to achieve controllable, diverse, and high‑fidelity visual results, outperforming existing methods in both qualitative and quantitative evaluations.

AIWalker

Jan 13, 2025

ArtCrafter: A Controllable, Diverse Style Transfer Framework from Tsinghua

Overview

Recent diffusion‑based text‑to‑image generation has made great strides in personalization, identity protection, object customization, and style transfer. However, existing approaches often rely on fine‑tuning entire models, which is costly. ArtCrafter proposes a lightweight, controllable style‑transfer framework that adds only a few trainable modules to a pretrained diffusion model.

Key Components

1. Attention‑Based Style Extraction

The framework introduces a multi‑layer style extractor built on perceiver attention and a feed‑forward network (FFN). Given a reference image, CLIP encodes it into an embedding x. A latent tensor z of shape (1, N, D) is initialized and duplicated across the batch dimension to shape (B, N, D). Perceiver attention (named P‑Attn) updates z by attending to both x and the duplicated latent variables, allowing the model to capture fine‑grained style cues such as texture, color, and composition.

2. Text‑Image Alignment Enhancement

This module converts image and text embeddings into query, key, and value matrices via linear layers. Scaled dot‑product attention computes similarity between image queries and text keys, followed by a softmax to obtain normalized weights. The weighted sum of the value matrix yields a multimodal embedding that dynamically balances the importance of different textual prompts, enabling more precise alignment between text semantics and visual output.

3. Explicit Modulation

To fuse the original image embedding with the enhanced multimodal embedding, ArtCrafter uses linear interpolation controlled by a constant α. The blended embedding is then concatenated with the text prompt embedding, forming the final conditioning vector for the diffusion model. This explicit modulation provides flexible control over the contribution of each modality, improving robustness and diversity.

Experiments and Results

Qualitative Evaluation

ArtCrafter was compared against several state‑of‑the‑art text‑guided style transfer methods (Styleshot, Style Aligned, VSP, InstantStyle, IP‑Adapter, CSGO, StyleCrafter). In prompts such as “fashionable shoes,” ArtCrafter generated multiple coherent shoe designs aligned with the text, whereas competitors showed inconsistencies or limited variety.

Quantitative Evaluation

Metrics including CLIP‑Text, CLIP‑Image, DINO‑v2, and LPIPS were used. Across all metrics, ArtCrafter achieved higher scores, indicating superior alignment with textual descriptions, better preservation of content details, and higher visual quality.

User Study

Professional artists rated generated images on text consistency, image consistency, and overall visual appeal. ArtCrafter received the highest average scores, confirming its practical advantage in creative workflows.

Ablation Study

Removing any of the three core modules—attention‑based style extraction, text‑image alignment enhancement, or explicit modulation—degraded performance in consistency, guidance strength, and diversity, demonstrating the necessity of each component.

Conclusion

ArtCrafter presents a new text‑image aligning style‑transfer framework that integrates attention‑driven style extraction, dynamic multimodal alignment, and explicit modulation. Comprehensive evaluations show that it delivers controllable, diverse, and high‑quality stylized images, advancing the state of text‑guided visual synthesis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-image Diffusion Models Attention Mechanism Style Transfer multimodal alignment

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.