Pseudo-Visual Speech Denoising

Overview

Pseudo-Visual Speech Denoising

This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar

PWC PWC

๐Ÿ“ Paper ๐Ÿ“‘ Project Page ๐Ÿ›  Demo Video ๐Ÿ—ƒ Real-World Test Set
Paper Website Video Real-World Test Set (coming soon)


Features

  • Denoise any real-world audio/video and obtain the clean speech.
  • Works in unconstrained settings for any speaker in any language.
  • Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
  • Complete training code and inference codes available.

Prerequisites

  • Python 3.7.4 (Code has been tested with this version)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth

Getting the weights

Model Description Link to the model
Denoising model Weights of the denoising model (needed for inference) Link
Lipsync student Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference) Link
Wav2Lip teacher Weights of the teacher lipsync model (only needed if you want to train the network from scratch) Link

Denoising any audio/video using the pre-trained model (Inference)

You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:

python inference.py --lipsync_student_model_path= --checkpoint_path= --input=

The result is saved (by default) in results/result.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.

Generating only the lip-movements for any given noisy audio/video

The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:

cd lipsync
python inference.py --checkpoint_path= --audio=

The result is saved (by default) in results/result_voice.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the visual stream.

Training

We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure
data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
โ”œโ”€โ”€ list of folders
โ”‚   โ”œโ”€โ”€ five-digit numbered video IDs ending with (.mp4)
Preprocess the dataset
python preprocess.py --data_root= --preprocessed_root=

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS3 folder structure
preprocessed_root (lrs3_preprocessed)
โ”œโ”€โ”€ list of folders
|	โ”œโ”€โ”€ Folders with five-digit numbered video IDs
|	โ”‚   โ”œโ”€โ”€ *.jpg (extracted face crops from each frame)
VGGSound folder structure

We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files) from here.

data_root (vgg_sound)
โ”œโ”€โ”€ *.wav (audio files)

Train!

There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.

Train the Student-Lipsync model

Navigate to the lipsync folder: cd lipsync

The lipsync model can be trained using:

python train_student.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --wav2lip_checkpoint_path= --checkpoint_dir=

Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.

Train the Denoising model!

Navigate to the main directory: cd ..

The denoising model can be trained using:

python train.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --lipsync_student_model_path= --checkpoint_dir=

The model can be resumed for training as well. Look at python train.py --help for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py file.


Evaluation

To be updated soon!


Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

@InProceedings{Hegde_2021_WACV,
    author    = {Hegde, Sindhu B. and Prajwal, K.R. and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title     = {Visual Speech Enhancement Without a Real Visual Stream},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {1926-1935}
}

Acknowledgements

Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the authors for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

You might also like...
Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection
Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection (NimPme) The official implementation of Novel Instances Mining with

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

APR The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study. Environment setu

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) -- xmodel(xilinx

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data
[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data (NeurIPS 2021) This repository provides the official PyTorch implementation

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images
Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images This repository contains the implementation of the following paper

Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

[CVPR 2022] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels
[CVPR 2022] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels

Using Unreliable Pseudo Labels Official PyTorch implementation of Semi-Supervised Semantic Segmentation Using Unreliable Pseudo Labels, CVPR 2022. Ple

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Comments
  • Real-time

    Real-time

    Hey. I went through the paper the results and I must mention that the work is remarkable. Congrats achieving such a tremendous results. However, I was wondering whether the model can be used in real-time applications or not? if it can, can you suggest the ways to do it?

    opened by msaad1311 6
  • inference --input missing value

    inference --input missing value

    Thank for your quick response in last issue @Rudrabha !! you are always doing great projects! I tried to run the inference and showed the following message, thanks :

    File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 629, in init self._file = self._open(file, mode_int, closefd) File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 1184, in _open "Error opening {0!r}: ".format(self.name)) File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) RuntimeError: Error opening 'tmp.wav': System error.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "inference.py", line 274, in predict(args) File "inference.py", line 174, in predict inp_wav = load_wav(args) File "inference.py", line 22, in load_wav wav = audio.load_wav(wav_file, sampling_rate) File "/root/pseudo-visual-speech-denoising/audio/audio_utils.py", line 7, in load_wav return librosa.core.load(path, sr=sr)[0] File "/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py", line 142, in load y, sr_native = __audioread_load(path, offset, duration, dtype) File "/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py", line 164, in __audioread_load with audioread.audio_open(path) as input_file: File "/usr/local/lib/python3.7/dist-packages/audioread/init.py", line 111, in audio_open return BackendClass(path) File "/usr/local/lib/python3.7/dist-packages/audioread/rawread.py", line 62, in init self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'tmp.wav'

    opened by chikiuso 2
  • dependency issues

    dependency issues

    I've tried with multiple versions of python, including python 3.7.4,

    pip usually runs into the following error:

    ERROR: Could not find a version that satisfies the requirement torch~=1.6.0 (from versions: 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0)
    ERROR: No matching distribution found for torch~=1.6.0
    

    sometimes it's other dependencies (e.g. on a later python, I've seen it fail with tensorflow-gpu). Can you perhaps share the exact testing methodology - i.e. what machine/os was used what software installed and commands used?

    opened by micsthepick 2
  • tmp.waw and temp.wav

    tmp.waw and temp.wav

    what are these tmp.wav and temp.wav in inference.py i am getting no such file or directory when i try to denoise using pretrained models given by you

    i had given
    python inference.py --lipsync_student_model_path=<"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech -denoising-main\lipsync\checkpoints\lipsync_student.pth"> --checkpoint_path=<"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\checkpoints\denoising.pt"> --input= <"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav">

    but i am getting this as an error(Please help me)

    The system cannot find the file specified. C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\util\decorators.py:88: UserWarning: PySoundFile failed. Trying audioread instead. return f(*args, **kwargs) Traceback (most recent call last): File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 155, in load context = sf.SoundFile(path) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 629, in init self._file = self._open(file, mode_int, closefd) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1183, in _open _error_check(_snd.sf_error(file_ptr), File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) RuntimeError: Error opening 'C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav': System error.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 292, in predict(args) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 175, in predict inp_wav = load_wav(args) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 22, in load_wav wav = audio.load_wav(wav_file, sampling_rate) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\audio\audio_utils.py", line 8, in load_wav return librosa.core.load(path, sr=sr)[0] File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\util\decorators.py", line 88, in inner_f return f(*args, **kwargs) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 174, in load y, sr_native = __audioread_load(path, offset, duration, dtype) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 198, in _audioread_load with audioread.audio_open(path) as input_file: File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread_init.py", line 111, in audio_open return BackendClass(path) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\rawread.py", line 62, in init self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav'

    opened by Tyson-svg 1
Owner
Sindhu
Masters' by Research (MS) @ CVIT, IIIT Hyderabad
Sindhu
Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021, official Pytorch implementatio

Microsoft 247 Dec 25, 2022
HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

Rishikesh (เค‹เคทเคฟเค•เฅ‡เคถ) 134 Dec 27, 2022
[NeurIPS 2020] Official repository for the project "Listening to Sound of Silence for Speech Denoising"

Listening to Sounds of Silence for Speech Denoising Introduction This is the repository of the "Listening to Sounds of Silence for Speech Denoising" p

Henry Xu 40 Dec 20, 2022
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 1, 2023
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 172 Dec 23, 2022
This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

Jungdae Kim 320 Jan 8, 2023
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

TorchSemiSeg [CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision by Xiaokang Chen1, Yuhui Yuan2, Gang Zeng1, Jingdong Wang

Chen XiaoKang 387 Jan 8, 2023
Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

SPICE: Semantic Pseudo-labeling for Image Clustering By Chuang Niu and Ge Wang This is a Pytorch implementation of the paper. (In updating) SOTA on 5

Chuang Niu 154 Dec 15, 2022
Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training

Flood Detection Challenge This repository contains code for our submission to the ETCI 2021 Competition on Flood Detection (Winning Solution #2). Acco

Siddha Ganju 108 Dec 28, 2022