While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.
Pipeline of the AttenST: We start with the style-preserving inversion to invert content image \( x^c_0 \) and style image \( x^s_0 \), obtaining their respective latent noise representations, denoted as \( X^c_T \) and \( X^s_T \). During this process, the query of the content image \( Q^{c} \) and the key-value pairs of the style image \( (K^{s}, V^{s}) \) are extracted. Subsequently, the proposed CA-AdaIN mechanism is employed to refine the latent representation of the content, producing \( x^{cs}_t \), which serves as the initial noise input for the UNet denoising process. Throughout denoising, the key and value derived from the self-attention of the style image are injected into the designated self-attention layers, facilitating the integration of style features. Simultaneously, the features of the style and content images are processed through the DF-CA and incorporated into the corresponding blocks via cross-attention. This strategy constrains the generation process, ensuring effective style integration while preserving the original content, thereby achieving an optimal balance between style and content fidelity.
Style-Preserving Inversion: We introduces a novel style-preserving inversion technique designed to enhance the accuracy of style transfer. The core innovation lies in refining the inversion direction during the inversion process. Unlike previous approaches that approximate the inversion direction by reversing the denoising trajectory, we start with a linear assumption to obtain the initial inversion estimate and iteratively refine the inversion trajectory through multiple resampling steps. This ensures better style integration and content preservation, achieving a balanced fusion of both.
Content-Aware AdaIN: Extensive research has demonstrated that precise initialization noise control substantially enhances generation quality. Building upon these insights, we implement adaptive instance normalization (AdaIN) to modulate the content image's latent noise representation by aligning its statistical properties (mean and variance) with style features, enabling early-stage style integration. Nevertheless, this approach tends to compromise content fidelity. To overcome this challenge, we introduce Content-Aware AdaIN (CA-AdaIN), an enhanced normalization technique that integrates content statistics during denoising initialization and employs dual modulation parameters (\( \alpha_s \) for style intensity and \( \alpha_c \) for content preservation) to achieve optimal balance between stylistic expression and content integrity.
$$ x_{T}^{cs} = (\alpha_s \sigma(x^s_T) + \alpha_c \sigma(x^c_T)) \left( \frac{x - \mu(x^c_T)}{\sigma(x^c_T)} \right) + (\alpha_s \mu(x^s_T) + \alpha_c \mu(x^c_T)) $$
Where \( \alpha_c \) and \( \alpha_s \) are parameters controlling the strength of the content and style features, and \( \alpha_c + \alpha_s = 1 \). The introduction of the content weight \( \alpha_c \) enables CA-AdaIN to retain a portion of the content feature statistics during normalization. By adjusting the ratio of \( \alpha_c \) and \( \alpha_s \), CA-AdaIN dynamically balances the representation of content and style, effectively mitigating the loss of content information during the style transfer process.
Dual-Feature Cross-Attention: we propose a dual-feature cross-attention (DF-CA) mechanism. This innovative approach maximizes the potential of attention mechanisms by simultaneously embedding both content and style features into the generation process through cross-attention mechanism. We employ pre-trained CLIP image encoders to extract semantically-aligned feature embeddings from content and style images. These embeddings capture the intrinsic semantic relationships and visual characteristics, providing robust representations of both content structure and stylistic elements. Following this, we compute the cross-attention for the content and style features using
$$ \phi^c = \text{Attention}(Q, K^c, V^c) = \text{Softmax}\left(\frac{Q K^{cT}}{\sqrt{d}}\right) V^c $$
$$ \phi^s = \text{Attention}(Q, K^s, V^s) = \text{Softmax}\left(\frac{Q K^{sT}}{\sqrt{d}}\right) V^s $$
The extracted image features are then integrated into the UNet via decoupled cross-attention. The final cross-attention calculation is demonstrated in the following equation:
$$ \phi^{final}= \phi^{text} +\phi^{c} +\phi^{s} $$
Additional comparison results between AttenST and traditional methods, highlighting the superiority of AttenST in terms of style transfer quality, content preservation, and overall generation performance.
We conduct comprehensive ablation studies to systematically evaluate the contribution of each component in our framework. (1) - SG-SA: removal of the style-guided self-attention mechanism; (2) - SPI: replacement of our style-preserving inversion with standard DDIM inversion; (3) - CA-AdaIN: substitution of our content-aware AdaIN with original AdaIN; and (4) - DF-CA: elimination of the dual-feature cross-attention mechanism.
Additional Results of AttenST
Additional Results of AttenST
@misc{huang2025attensttrainingfreeattentiondrivenstyle,
title={AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models},
author={Bo Huang and Wenlun Xu and Qizhuo Han and Haodong Jing and Ying Li},
year={2025},
eprint={2503.07307},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.07307},
}