Pseudo-Visual Speech Denoising

Sindhu

Last update: Oct 22, 2022

Related tags

Deep Learning pseudo-visual-speech-denoising

Overview

Pseudo-Visual Speech Denoising

This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar

📝 Paper	📑 Project Page	🛠 Demo Video	🗃 Real-World Test Set
Paper	Website	Video	Real-World Test Set (coming soon)

Features

Denoise any real-world audio/video and obtain the clean speech.
Works in unconstrained settings for any speaker in any language.
Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
Complete training code and inference codes available.

Prerequisites

Python 3.7.4 (Code has been tested with this version)
ffmpeg: sudo apt-get install ffmpeg
Install necessary packages using pip install -r requirements.txt
Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth

Getting the weights

Model	Description	Link to the model
Denoising model	Weights of the denoising model (needed for inference)	Link
Lipsync student	Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference)	Link
Wav2Lip teacher	Weights of the teacher lipsync model (only needed if you want to train the network from scratch)	Link

Denoising any audio/video using the pre-trained model (Inference)

You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:

python inference.py --lipsync_student_model_path= --checkpoint_path= --input=

The result is saved (by default) in results/result.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.

Generating only the lip-movements for any given noisy audio/video

The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:

cd lipsync
python inference.py --checkpoint_path= --audio=

The result is saved (by default) in results/result_voice.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the visual stream.

Training

We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure

data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
├── list of folders
│   ├── five-digit numbered video IDs ending with (.mp4)

Preprocess the dataset

python preprocess.py --data_root= --preprocessed_root=

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS3 folder structure

preprocessed_root (lrs3_preprocessed)
├── list of folders
|	├── Folders with five-digit numbered video IDs
|	│   ├── *.jpg (extracted face crops from each frame)

VGGSound folder structure

We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files) from here.

data_root (vgg_sound)
├── *.wav (audio files)

Train!

There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.

Train the Student-Lipsync model

Navigate to the lipsync folder: cd lipsync

The lipsync model can be trained using:

python train_student.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --wav2lip_checkpoint_path= --checkpoint_dir=

Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.

Train the Denoising model!

Navigate to the main directory: cd ..

The denoising model can be trained using:

python train.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --lipsync_student_model_path= --checkpoint_dir=

The model can be resumed for training as well. Look at python train.py --help for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py file.

Evaluation

To be updated soon!

Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

@InProceedings{Hegde_2021_WACV,
    author    = {Hegde, Sindhu B. and Prajwal, K.R. and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title     = {Visual Speech Enhancement Without a Real Visual Stream},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {1926-1935}
}

Acknowledgements

Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the authors for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

You might also like...

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection (NimPme) The official implementation of Novel Instances Mining with

12 Sep 8, 2022

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

APR The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study. Environment setu

8 Nov 26, 2022

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) -- xmodel(xilinx

10 Oct 28, 2022

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

6 Nov 25, 2022

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data (NeurIPS 2021) This repository provides the official PyTorch implementation

155 Nov 30, 2021

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images This repository contains the implementation of the following paper

9 Jul 30, 2022

Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

1 Dec 27, 2021

[CVPR 2022] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels

Using Unreliable Pseudo Labels Official PyTorch implementation of Semi-Supervised Semantic Segmentation Using Unreliable Pseudo Labels, CVPR 2022. Ple

268 Dec 24, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

Comments

Real-time

Hey. I went through the paper the results and I must mention that the work is remarkable. Congrats achieving such a tremendous results. However, I was wondering whether the model can be used in real-time applications or not? if it can, can you suggest the ways to do it?

opened by msaad1311 6
inference --input missing value

Thank for your quick response in last issue @Rudrabha !! you are always doing great projects! I tried to run the inference and showed the following message, thanks :

File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 629, in init self._file = self._open(file, mode_int, closefd) File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 1184, in _open "Error opening {0!r}: ".format(self.name)) File "/usr/local/lib/python3.7/dist-packages/soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) RuntimeError: Error opening 'tmp.wav': System error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "inference.py", line 274, in predict(args) File "inference.py", line 174, in predict inp_wav = load_wav(args) File "inference.py", line 22, in load_wav wav = audio.load_wav(wav_file, sampling_rate) File "/root/pseudo-visual-speech-denoising/audio/audio_utils.py", line 7, in load_wav return librosa.core.load(path, sr=sr)[0] File "/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py", line 142, in load y, sr_native = __audioread_load(path, offset, duration, dtype) File "/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py", line 164, in __audioread_load with audioread.audio_open(path) as input_file: File "/usr/local/lib/python3.7/dist-packages/audioread/init.py", line 111, in audio_open return BackendClass(path) File "/usr/local/lib/python3.7/dist-packages/audioread/rawread.py", line 62, in init self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'tmp.wav'

opened by chikiuso 2
dependency issues
I've tried with multiple versions of python, including python 3.7.4,

pip usually runs into the following error:

ERROR: Could not find a version that satisfies the requirement torch~=1.6.0 (from versions: 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0) ERROR: No matching distribution found for torch~=1.6.0

sometimes it's other dependencies (e.g. on a later python, I've seen it fail with tensorflow-gpu). Can you perhaps share the exact testing methodology - i.e. what machine/os was used what software installed and commands used?
opened by micsthepick 2
tmp.waw and temp.wav

what are these tmp.wav and temp.wav in inference.py i am getting no such file or directory when i try to denoise using pretrained models given by you

i had given
python inference.py --lipsync_student_model_path=<"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech -denoising-main\lipsync\checkpoints\lipsync_student.pth"> --checkpoint_path=<"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\checkpoints\denoising.pt"> --input= <"C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav">

but i am getting this as an error(Please help me)

The system cannot find the file specified. C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\util\decorators.py:88: UserWarning: PySoundFile failed. Trying audioread instead. return f(*args, **kwargs) Traceback (most recent call last): File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 155, in load context = sf.SoundFile(path) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 629, in init self._file = self._open(file, mode_int, closefd) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1183, in _open _error_check(_snd.sf_error(file_ptr), File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) RuntimeError: Error opening 'C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav': System error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 292, in predict(args) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 175, in predict inp_wav = load_wav(args) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\inference.py", line 22, in load_wav wav = audio.load_wav(wav_file, sampling_rate) File "C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\audio\audio_utils.py", line 8, in load_wav return librosa.core.load(path, sr=sr)[0] File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\util\decorators.py", line 88, in inner_f return f(*args, **kwargs) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 174, in load y, sr_native = __audioread_load(path, offset, duration, dtype) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 198, in _audioread_load with audioread.audio_open(path) as input_file: File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread_init.py", line 111, in audio_open return BackendClass(path) File "C:\Users\savdo\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\rawread.py", line 62, in init self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\savdo\OneDrive\Desktop\major\pseudo-visual-speech-denoising-main\timit_3spk.wav'

opened by Tyson-svg 1

Owner

Sindhu

Masters' by Research (MS) @ CVIT, IIIT Hyderabad

GitHub

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021, official Pytorch implementatio

247 Dec 25, 2022

HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

134 Dec 27, 2022

[NeurIPS 2020] Official repository for the project "Listening to Sound of Silence for Speech Denoising"

Listening to Sounds of Silence for Speech Denoising Introduction This is the repository of the "Listening to Sounds of Silence for Speech Denoising" p

40 Dec 20, 2022

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

157 Jan 1, 2023

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

172 Dec 23, 2022

This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

320 Jan 8, 2023

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

145 Dec 30, 2022

Pseudo-Visual Speech Denoising

Related tags

Overview

Pseudo-Visual Speech Denoising

Features

Prerequisites

Getting the weights

Denoising any audio/video using the pre-trained model (Inference)

Generating only the lip-movements for any given noisy audio/video

Training

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure

Preprocess the dataset

Preprocessed LRS3 folder structure

VGGSound folder structure

Train!

Train the Student-Lipsync model

Train the Denoising model!

Evaluation

Licence and Citation

Acknowledgements

You might also like...

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images

Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

[CVPR 2022] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Comments

Real-time

inference --input missing value

dependency issues

tmp.waw and temp.wav

Owner

Sindhu

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

[NeurIPS 2020] Official repository for the project "Listening to Sound of Silence for Speech Denoising"

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

This is an unofficial PyTorch implementation of Meta Pseudo Labels

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training