Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Overview

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

This repository contains a pytorch implementation of "Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion"

report

This codebase provides:

  • train code
  • test code
  • dataset
  • pretrained motion models

The main sections are:

  • Overview
  • Instalation
  • Download Data and Models
  • Training from Scratch
  • Testing with Pretrained Models

Please note, we will not be providing visualization code for the photorealistic rendering.

Overview:

We provide models and code to train and test our listener motion models.

See below for sections:

  • Installation: environment setup and installation for visualization
  • Download data and models: download annotations and pre-trained models
  • Training from scratch: scripts to get the training pipeline running from scratch
  • Testing with pretrianed models: scripts to test pretrained models and save output motion parameters

Installation:

Tested with cuda/9.0, cudnn/v7.0-cuda.9.0, and python 3.6.11

git clone [email protected]:evonneng/learning2listen.git

cd learning2listen/src/
conda create -n venv_l2l python=3.6
conda activate venv_l2l
pip install -r requirements.txt

export L2L_PATH=`pwd`

IMPORTANT: After installing torch, please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Download Data and Models:

Download Data:

Please first download the dataset for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/data/ (e.g. $L2L_PATH/data/conan_data.tar)

Then run the following script.

./scripts/unpack_data.sh

The downloaded data will unpack into the following directory structure as viewed from $L2L_PATH:

|-- data/
    |-- conan/
        |-- test/
            |-- p0_list_faces_clean_deca.npy
            |-- p0_speak_audio_clean_deca.npy
            |-- p0_speak_faces_clean_deca.npy
            |-- p0_speak_files_clean_deca.npy
            |-- p1_list_faces_clean_deca.npy
            |-- p1_speak_audio_clean_deca.npy
            |-- p1_speak_faces_clean_deca.npy
            |-- p1_speak_files_clean_deca.npy
        |-- train/
    |-- devi2/
    |-- fallon/
    |-- kimmel/
    |-- stephen/
    |-- trevor/

Our dataset consists of 6 different youtube channels named accordingly. Please see comments in $L2L_PATH/scripts/download_models.sh for more details.

Data Format:

The data format is as described below:

We denote p0 as the person on the left side of the video, and p1 as the right side.

  • p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
    • N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).
  • p0_speak_audio_clean_deca.npy - audio features (N x 256 x 128) for when p0 is speaking
    • N sequences of length 256. Features of size 128 mel features
  • p0_speak_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is speaking
  • p0_speak_files_clean_deca.npy - file names of the format (N x 64 x 3) for when p0 is speaking

Using Your Own Data:

To train and test on your own videos, please follow this process to convert your data into a compatible format:

(Optional) In our paper, we ran preprocessing to figure out when a each person is speaking or listening. We used this information to segment/chunk up our data. We then extracted speaker-only audio by removing listener back-channels.

  1. Run SyncNet on the video to determine who is speaking when.
  2. Then run Multi Sensory to obtain speaker's audio with all the listener backchannels removed.

For the main processing, we assuming there are 2 people in the video - one speaker and one listener...

  1. Run DECA to extract the facial expression and pose details of the two faces for each frame in the video. For each person combine the extracted features across the video into a (1 x T x (50+6)) matrix and save to p0_list_faces_clean_deca.npy or p0_speak_faces_clean_deca.npy files respectively. Note, in concatenating the features, expression comes first.

  2. Use librosa.feature.melspectrogram(...) to process the speaker's audio into a (1 x 4T x 128) feature. Save to p0_speak_audio_clean_deca.npy.

Download Model:

Please first download the models for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/models/ (e.g. $L2L_PATH/models/conan_models.tar)

Once downloaded, you can run the follow script to unpack all of the models.

cd $L2L_PATH
./scripts/unpack_models.sh

We provide person-specific models trained for Conan, Fallon, Stephen, and Trevor. Each person-specific model consists of 2 models: 1) VQ-VAE pre-trained codebook of motion in $L2L_PATH/vqgan/models/ and 2) predictor model for listener motion prediction in $L2L_PATH/models/. It is important that the models are paired correctly during test time.

In addition to the models, we also provide the corresponding config files that were used to define the models/listener training setup.

Please see comments in $L2L_PATH/scripts/unpack_models.sh for more details.

Training from Scratch:

Training a model from scratch follows a 2-step process.

  1. Train the VQ-VAE codebook of listener motion:
# --config: the config file associated with training the codebook
# Includes network setup information and listener information
# See provided config: configs/l2_32_smoothSS.json

cd $L2L_PATH/vqgan/
python train_vq_transformer.py --config <path_to_config_file>

Please note, during training of the codebook, it is normal for the loss to increase before decreasing. Typical training was ~2 days on 4 GPUs.

  1. After training of the VQ-VAE has converged, we can begin training the predictor model that uses this codebook.
# --config: the config file associated with training the predictor
# Includes network setup information and codebook information
# Note, you will have to update this config to point to the correct codebook.
# See provided config: configs/vq/delta_v6.json

cd $L2L_PATH
python -u train_vq_decoder.py --config <path_to_config_file>

Training the predictor model should have a much faster convergance. Typical training was ~half a day on 4 GPUs.

Testing with Pretrained Models:

# --config: the config file associated with training the predictor 
# --checkpoint: the path to the pretrained model
# --speaker: can specify which speaker you want to test on (conan, trevor, stephen, fallon, kimmel)

cd $L2L_PATH
python test_vq_decoder.py --config <path_to_config> --checkpoint <path_to_pretrained_model> --speaker <optional>

For our provided models and configs you can run:

python test_vq_decoder.py --config configs/vq/delta_v6.json --checkpoint models/delta_v6_er2er_best.pth --speaker 'conan'

Visualization

As part of responsible practices, we will not be releasing code for the photorealistic visualization pipeline. However, the raw 3D meshes can be rendered using the DECA renderer.

Potentially Coming Soon

  • Visualization of 3D meshes code from saved output
Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

peng 56 Jun 13, 2022
Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

OpenGaze: Web Service for OpenFace Facial Behaviour Analysis Toolkit Overview OpenFace is a fantastic tool intended for computer vision and machine le

Sayom Shakib 3 May 23, 2022
OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

OpenFace 2.2.0: a facial behavior analysis toolkit Over the past few years, there has been an increased interest in automatic facial behavior analysis

Tadas Baltrusaitis 5.7k Jul 4, 2022
Automatically measure the facial Width-To-Height ratio and get facial analysis results provided by Microsoft Azure

fwhr-calc-website This project is to automatically measure the facial Width-To-Height ratio and get facial analysis results provided by Microsoft Azur

SoohyunPark 1 Feb 7, 2022
A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

Phil Tabor 117 Jun 21, 2022
[CVPR 2022] TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing (CVPR 2022) This repository provides the official PyTorch impleme

Billy XU 103 Jun 25, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 42 Jun 24, 2022
Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022)

Pop-Out Motion Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022) Jihyun Lee*, Minhyuk Sung*, Hyunjin Kim, Tae-Ky

Jihyun Lee 82 Jun 24, 2022
A variational Bayesian method for similarity learning in non-rigid image registration (CVPR 2022)

A variational Bayesian method for similarity learning in non-rigid image registration We provide the source code and the trained models used in the re

daniel grzech 11 Jun 19, 2022
The official repo for OC-SORT: Observation-Centric SORT on video Multi-Object Tracking. OC-SORT is simple, online and robust to occlusion/non-linear motion.

OC-SORT Observation-Centric SORT (OC-SORT) is a pure motion-model-based multi-object tracker. It aims to improve tracking robustness in crowded scenes

Jinkun Cao 186 Jun 29, 2022
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 152 Jun 22, 2022
MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

null 159 Jun 23, 2022
Code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

Deep Deterministic Uncertainty This repository contains the code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic

Jishnu Mukhoti 54 Jun 28, 2022
This tool converts a Nondeterministic Finite Automata (NFA) into a Deterministic Finite Automata (DFA)

This tool converts a Nondeterministic Finite Automata (NFA) into a Deterministic Finite Automata (DFA)

Quinn Herden 1 Feb 4, 2022
Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes / 3DCrowdNet News ?? 3DCrowdNet achieves the state-of-the-art accuracy on 3D

Hongsuk Choi 74 Jun 27, 2022
Code for "Modeling Indirect Illumination for Inverse Rendering", CVPR 2022

Modeling Indirect Illumination for Inverse Rendering Project Page | Paper | Data Preparation Set up the python environment conda create -n invrender p

ZJU3DV 77 Jun 29, 2022
Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval PyTorch This is the PyTorch implementation of Retrieve in Style: Unsupervised Fa

null 58 Jun 18, 2022
This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Hierarchical Motion Understanding via Motion Programs (CVPR 2021) This repository contains the official implementation of: Hierarchical Motion Underst

Sumith Kulal 31 Jun 20, 2022
Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance This is the codebase for video-based human motion reconstruction in human-mot

Jiachen Xu 4 Jan 18, 2022