AQUA: Aligned Query Fusion for Reference-Unbiased and Temporally Consistent Video Motion Transfer

Anonymous submission



first

We present AQUA, a training-free motion transfer method for text-to-video diffusion models that enhances semantic reflection and temporal consistency. Direct query injection introduces unintended reference content, such as a mountain background, a backpack artifact (top), and a car (bottom), along with abrupt visual artifacts like a green object (top) and a purple region (bottom). In contrast, AQUA generates temporally coherent videos, faithfully aligned with the target prompt.




Additional results on the MTBench_HQ dataset further demonstrate that our method consistently performs effective motion transfer across a wide range of videos.



Abstract


Diffusion models have enabled motion transfer that reflects motion from a reference video while aligning with a given target prompt. Prior methods typically require model training or fine-tuning, limiting their flexibility and broad applicability. Recently, training-free methods have received attention as a more generalizable alternative. Among these, the approach utilizing self-attention query features from the reference video provides a simple yet effective strategy. However, direct use of reference query features often generates videos with unwanted visual details from the reference video and frame-level flickering. In this paper, we analyze query features in terms of motion transfer and propose a query fusion method that modulates the reference query features to address these problems. Specifically, our method, AQUA, adaptively fuses the reference and target query features to remain faithful to the target prompt while following the reference motion. Furthermore, AQUA employs a multi-frame guidance to ensure temporal consistency. Extensive experiments demonstrate that our method outperforms existing methods without additional training or optimization.




Video




Analysis of Query in Self-attention

pipeline
pipeline

Query analysis. Visualization of the reference video, PCA projection query features, and their t-x slices. The query feature maintains consistent spatial representations across time, aligned with the reference video.



pipeline

Query-induced issues. (a) Failure example. Reference bias is visible in the leaked mountain background and the backpack-like artifact (green box), while temporal inconsistency appears as an abrupt green object (yellow box). (b) Our result. The generated video aligns with the target prompt and exhibits improved temporal consistency.



Method

pipeline

Overall framework of AQUA. (a) Given a reference video, we perform DDPM inversion to extract query features. The noise from the inversion is then used to generate a video via DAWN and RTA, which utilize the query features. (b) The reference and target queries are adaptively fused using a distribution-aware weight function D to guide motion transfer. (c) Multi-frame attention output Ai is incorporated to enhance temporal consistency by utilizing contextual information from the current frame (i), previous frame (i − 1), and first frame (1).



Qualitative Results.

We provide qualitative results for our method and other baseline models, where videos are generated based on the reference video and guided by the target prompt. Our method produces videos that are both temporally consistent and semantically aligned with the given prompt.





Quantitative comparison.



pipeline

We conduct our experiments on the MTBench HQ. The evaluation consists of three aspects: (1) Automatic metrics, referring to algorithmically computed measures that assess prompt alignment, temporal consistency, and motion fidelity; (2) Human evaluation, providing subjective assessment of the same criteria; and (3) Efficiency metrics, including computation time and peak memory usage for generating a motion-transferred video from a reference video