[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Deep Learning FTCN

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN)

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen

Accepted by ICCV 2021



Although current face manipulation techniques achieve impressive performance regarding quality and controllability, they are struggling to generate temporal coherent face videos. In this work, we explore to take full advantage of the temporal coherence for video face forgery detection. To achieve this, we propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN). The key insight of FTCN is to reduce the spatial convolution kernel size to 1, while maintaining the temporal convolution kernel size unchanged. We surprisingly find this special design can benefit the model for extracting the temporal features as well as improve the generalization capability. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence. The proposed framework is general and flexible, which can be directly trained from scratch without any pre-training models or external datasets. Extensive experiments show that our framework outperforms existing methods and remains effective when applied to detect new sorts of face forgery videos.


First setup python environment with pytorch 1.4.0 installed, it's highly recommended to use docker image pytorch/pytorch:1.4-cuda10.1-cudnn7-devel, as the pretrained model and the code might be incompatible with higher version pytorch.

then install dependencies for the experiment:

pip install -r requirements.txt


Inference Using Pretrained Model on Raw Video

Download FTCN+TT model trained on FF++ from here and place it under ./checkpoints folder

python test_on_raw_video.py examples/shining.mp4 output

the output will be a video under folder output named shining.avi


  • Release inference code.
  • Release training code.
  • Code cleaning.


This code borrows heavily from SlowFast.

The face detection network comes from biubug6/Pytorch_Retinaface.

The face alignment network comes from cunjian/pytorch_face_landmark.


If you use this code for your research, please cite our paper.

  title={Exploring Temporal Coherence for More General Video Face Forgery Detection},
  author={Zheng, Yinglin and Bao, Jianmin and Chen, Dong and Zeng, Ming and Wen, Fang},
  journal={arXiv preprint arXiv:2108.06693},
  • Question about the structure of ResNet3D

    Question about the structure of ResNet3D

    您好,代码中conv1的kernel size为[5,7,7],stride为[1,2,2]。而论文中kernel size为[5,1,1],stride为[1,1,1]。 请问,是否可以给出论文中实际使用的,完整的模型结构呢?

    temp_kernel[0][0] = [5]
    self.s1 = stem_helper.VideoModelStem(
        kernel=[temp_kernel[0][0] + [7, 7]],
        stride=[[1, 2, 2]],
        padding=[[temp_kernel[0][0][0] // 2, 3, 3]],
    opened by crywang 2
  • 关于模型结构的问题


    按文章中的结构,每个ResBlock中a、b、c三个kernel的size分别应为[1,1,1],[3,1,1]与[1,1,1]。 但代码所输出结构与文中结构不符(如下),或许是理解错误,烦请解惑: res2:

      (s2): ResStage(
        (pathway0_res0): ResBlock(
          (branch1): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=[1, 1, 1], bias=False)
          (branch1_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (branch2): BottleneckTransform(
            (a): Conv3d(64, 64, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(64, 64, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(64, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res1): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(256, 64, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(64, 64, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(64, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res2): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(256, 64, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(64, 64, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(64, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)


    (s3): ResStage(
        (pathway0_res0): ResBlock(
          (branch1): Conv3d(256, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
          (branch1_bn): Sequential(
            (0): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
          (branch2): BottleneckTransform(
            (a): Conv3d(256, 128, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(128, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): Sequential(
              (0): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(128, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res1): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(512, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(128, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(128, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res2): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(512, 128, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(128, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(128, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res3): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(512, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(128, 128, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(128, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)


    (s4): ResStage(
        (pathway0_res0): ResBlock(
          (branch1): Conv3d(512, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
          (branch1_bn): Sequential(
            (0): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
          (branch2): BottleneckTransform(
            (a): Conv3d(512, 256, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): Sequential(
              (0): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res1): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res2): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 256, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res3): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res4): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 256, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res5): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(256, 256, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(256, 1024, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)


    (s5): ResStage(
        (pathway0_res0): ResBlock(
          (branch1): Conv3d(1024, 2048, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
          (branch1_bn): Sequential(
            (0): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
          (branch2): BottleneckTransform(
            (a): Conv3d(1024, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(512, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): Sequential(
              (0): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(512, 2048, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res1): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(2048, 512, kernel_size=[3, 1, 1], stride=[1, 1, 1], padding=[1, 0, 0], bias=False)
            (a_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(512, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(512, 2048, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        (pathway0_res2): ResBlock(
          (branch2): BottleneckTransform(
            (a): Conv3d(2048, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (a_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (a_relu): ReLU(inplace=True)
            (b): Conv3d(512, 512, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], bias=False)
            (b_bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (b_relu): ReLU(inplace=True)
            (c): Conv3d(512, 2048, kernel_size=[1, 1, 1], stride=[1, 1, 1], padding=[0, 0, 0], bias=False)
            (c_bn): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
    opened by crywang 1
  • Bug in test_on_raw_video

    Bug in test_on_raw_video


                l_post = len(post_module)
                post_module = post_module * (pad_length // l_post + 1)
                post_module = post_module[:pad_length]
                assert len(post_module) == pad_length
                pre_module = inner_index + inner_index[1:-1][::-1]
                l_pre = len(post_module)
                pre_module = pre_module * (pad_length // l_pre + 1)
                pre_module = pre_module[-pad_length:]
                assert len(pre_module) == pad_length

    the code

     l_pre = len(post_module)

    should be replaced by

     l_pre = len(pre_module)

    is it right?

    opened by LOOKCC 0
