PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

This repository contains the PyTorch implementation of the following paper:

Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, and Yong Man Ro
[Paper] [Demo Video]

Preparation

Requirements

  • python 3.7
  • pytorch 1.6 ~ 1.8
  • torchvision
  • torchaudio
  • ffmpeg
  • av
  • tensorboard
  • scikit-image
  • pillow
  • librosa
  • pystoi
  • pesq
  • scipy

Datasets

Download

GRID dataset (video normal) can be downloaded from the below link.

For data preprocessing, download the face landmark of GRID from the below link.

Preprocessing

After download the dataset, preprocess the dataset with the following scripts in ./preprocess.
It supposes the data directory is constructed as

Data_dir
├── subject
|   ├── video
|   |   └── xxx.mpg
  1. Extract frames
    Extract_frames.py extract images and audio from the video.
python Extract_frames.py --Grid_dir "Data dir of GRID_corpus" --Out_dir "Output dir of images and audio of GRID_corpus"
  1. Align faces and audio processing
    Preprocess.py aligns faces and generates videos, which enables cropping the video lip-centered during training.
python Preprocess.py \
--Data_dir "Data dir of extracted images and audio of GRID_corpus" \
--Landmark "Downloaded landmark dir of GRID" \
--Output_dir "Output dir of processed data"

Training the Model

The speaker setting (different subject) can be selected by subject argument. Please refer to below examples.
To train the model, run following command:

# Data Parallel training example using 4 GPUs for multi-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 88 \
--epochs 500 \
--subject 'overlap' \
--eval_step 720 \
--dataparallel \
--gpu 0,1,2,3
# 1 GPU training example for GRID for unseen-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 22 \
--epochs 500 \
--subject 'unseen' \
--eval_step 1000 \
--gpu 0

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint_dir: directory for saving checkpoints
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --epochs: number of epochs
  • --augmentations: whether performing augmentation
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • --lr: learning rate
  • --eval_step: steps for performing evaluation
  • --window_size: number of frames to be used for training
  • Refer to train.py for the other training parameters

The evaluation during training is performed for a subset of the validation dataset due to the heavy time costs of waveform conversion (griffin-lim).
In order to evaluate the entire performance of the trained model run the test code (refer to "Testing the Model" section).

check the training logs

tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'

The tensorboard shows the training and validation loss, evaluation metrics, generated mel-spectrogram, and audio

Testing the Model

To test the model, run following command:

# Dataparallel test example for multi-speaker setting in GRID
python test.py \
--grid 'enter_the_processed_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 100 \
--subject 'overlap' \
--save_mel \
--save_wav \
--dataparallel \
--gpu 0,1

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --save_mel: whether to save the 'mel_spectrogram' and 'spectrogram' in .npz format
  • --save_wav: whether to save the 'waveform' in .wav format
  • --gpu: gpu number for training
  • Refer to test.py for the other parameters

Test Automatic Speech Recognition (ASR) results of generated results: WER

Transcription (Ground-truth) of GRID dataset can be downloaded from the below link.

move to the ASR_model directory

cd ASR_model/GRID

To evaluate the WER, run following command:

# test example for multi-speaker setting in GRID
python test.py \
--data 'enter_the_generated_data_dir (mel or wav) (ex. ./../../test/spec_mel)' \
--gtpath 'enter_the_downloaded_transcription_path' \
--subject 'overlap' \
--gpu 0

Descriptions of training parameters are as follows:

  • --data: Data for evaluation (wav or mel(.npz))
  • --wav : whether the data is waveform or not
  • --batch_size: batch size
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • Refer to ./ASR_model/GRID/test.py for the other parameters

Pre-trained ASR model checkpoint

Below lists are the pre-trained ASR model to evaluate the generated speech.
WER shows the original performances of the model on ground-truth audio.

Setting WER
GRID (constrained-speaker) 0.83 %
GRID (multi-speaker) 1.67 %
GRID (unseen-speaker) 0.37 %
LRW 1.54 %

Put the checkpoints in ./ASR_model/GRID/data for GRID, and in ./ASR_model/LRW/data for LRW.

Citation

If you find this work useful in your research, please cite the paper:

@article{kim2021vcagan,
  title={Lip to Speech Synthesis with Visual Context Attentional GAN},
  author={Kim, Minsu and Hong, Joanna and Ro, Yong Man},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
You might also like...
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Chunked Autoregressive GAN (CARGAN) Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [compan

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization This codebase is the official implementation of Test-Time Classifier A

Comments
  • Something wrong in train.py line 396

    Something wrong in train.py line 396 "wav_spec = val_data.inverse_spec(gs[:, :, :, :mel_len[0]].detach(), stft)"

    Traceback (most recent call last): File "train.py", line 479, in train_net(args) File "train.py", line 122, in train_net _ = validate(v_front, gen, post, fast_validate=True) File "train.py", line 396, in validate wav_spec = val_data.inverse_spec(gs[:, :, :, :mel_len[0]].detach(), stft) File "XXXX/vid_aud_grid.py", line 216, in inverse_spec wav = griffin_lim(spec.squeeze(1), stft.stft_fn, 60).squeeze(1) # B,L File "XXXX/audio_processing.py", line 63, in griffin_lim signal = stft_fn.inverse(magnitudes, angles).squeeze(1) File "XXXX/Visual-Context-Attentional-GAN/src/data/stft.py", line 108, in inverse padding=0) RuntimeError: Given transposed=1, weight of size [1026, 1, 1024], expected input[44, 642, 298] to have 1026 channels, but got 642 channels instead

    opened by Zhengyan-Sheng 1
Owner
null
Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) Overview Prerequisites Linux Pytho

Shaojie Li 34 Mar 31, 2022
Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

Zhying 77 Dec 21, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Rishikesh (ऋषिकेश) 31 Dec 8, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
CoaT: Co-Scale Conv-Attentional Image Transformers

CoaT: Co-Scale Conv-Attentional Image Transformers Introduction This repository contains the official code and pretrained models for CoaT: Co-Scale Co

mlpc-ucsd 191 Dec 3, 2022
Attentional Focus Modulates Automatic Finger‑tapping Movements

"Attentional Focus Modulates Automatic Finger‑tapping Movements", in Scientific Reports

Xingxun Jiang 1 Dec 2, 2021
Super Pix Adv - Offical implemention of Robust Superpixel-Guided Attentional Adversarial Attack (CVPR2020)

Super_Pix_Adv Offical implemention of Robust Superpixel-Guided Attentional Adver

DLight 8 Oct 26, 2022
Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

GAN-Supervised Dense Visual Alignment — Official PyTorch Implementation Paper | Project Page | Video This repo contains training, evaluation and visua

null 944 Jan 7, 2023
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

WaveGrad2 - PyTorch Implementation PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. Status (202

Keon Lee 59 Dec 6, 2022