VividFace

Please select and click on a face image to view the face swapping demo!

Target VividFace (Ours)

Introduction

Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos.

To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations.

Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

VividFace Framewrok

Overview of the proposed framework. During training, our framework randomly chooses static images or video sequences as the training data. In addition to the noise z_t, three other types of inputs are integrated to guide the generation process: (1) a face region mask, which controls the generation of facial imagery; (2) a 3D reconstructed face, which helps guide the pose and expression, especially in cases of large pose variations; and (3) masked source images, which supply background information. These inputs are processed through the Backbone Network, which performs the denoising operation. Within the Backbone Network, we employ cross-attention and temporal attention mechanisms. The temporal attention module ensures temporal continuity and consistency across frames. Our face encoder extracts identity and texture features from the target face, as well as pose and expression details from the source face, and uses these features in cross-attention to produce realistic and high-fidelity results.

VidFaceVAE

Overview of the proposed VidFaceVAE, capable of simultaneous encoding and decoding of both image and video data. Certain modules are specifically designed for video inputs, and image inputs bypass these modules as needed.

AIDT dataset

Demos

[more!] Original videos shown in the manuscript

[more!] Additional video demo comparison

Occlusion and large pose scenarios

Note: Other methods (such as FSGAN, DiffFace, DiffSwap, and REFace) tend to produce errors and cannot reliably generate videos.

Source

Target

Ours (Without Occlusion Augmentation)

Ours (With Occlusion Augmentation)

SimSwap

Target

Ours (Without 3D Reconstruction)

Ours (With 3D Reconstruction)

SimSwap

Comparison with other methods

Source

Target

VividFace

DiffSwap

FSGAN

REFace

SimSwap

Source

Target

VividFace

DiffSwap

FSGAN

REFace

SimSwap

BibTeX


      @misc{shao2024vividfacediffusionbasedhybridframework,
      title={VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping},
      author={Hao Shao and Shulun Wang and Yang Zhou and Guanglu Song and Dailan He and Shuo Qin and Zhuofan Zong and Bingqi Ma and Yu Liu and Hongsheng Li},
      year={2024},
      eprint={2412.11279},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.11279}}