Unofficial PyTorch implementation of Google AI's VoiceFilter system

MINDs Lab

Last update: Jan 3, 2023

Related tags

Overview

VoiceFilter

Note from Seung-won (2020.10.25)

Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-source, and I didn't expect this repository to grab such a great amount of attention for a long time. I would like to thank everyone for giving such attention, and also Mr. Quan Wang (the first author of the VoiceFilter paper) for referring this project in his paper.

Actually, this project was done by me when it was only 3 months after I started studying deep learning & speech separation without a supervisor in the relevant field. Back then, I didn't know what is a power-law compression, and the correct way to validate/test the models. Now that I've spent more time on deep learning & speech since then (I also wrote a paper published at Interspeech 2020 😊 ), I can observe some obvious mistakes that I've made. Those issues were kindly raised by GitHub users; please refer to the Issues and Pull Requests for that. That being said, this repository can be quite unreliable, and I would like to remind everyone to use this code at their own risk (as specified in LICENSE).

Unfortunately, I can't afford extra time on revising this project or reviewing the Issues / Pull Requests. Instead, I would like to offer some pointers to newer, more reliable resources:

VoiceFilter-Lite: This is a newer version of VoiceFilter presented at Interspeech 2020, which is also written by Mr. Quan Wang (and his colleagues at Google). I highly recommend checking this paper, since it focused on a more realistic situation where VoiceFilter is needed.
List of VoiceFilter implementation available on GitHub: In March 2019, this repository was the only available open-source implementation of VoiceFilter. However, much better implementations that deserve more attention became available across GitHub. Please check them, and choose the one that meets your demand.
PyTorch Lightning: Back in 2019, I could not find a great deep-learning project template for myself, so I and my colleagues had used this project as a template for other new projects. For people who are searching for such project template, I would like to strongly recommend PyTorch Lightning. Even though I had done a lot of effort into developing my own template during 2019 (VoiceFilter -> RandWireNN -> MelNet -> MelGAN), I found PyTorch Lightning much better than my own template.

Thanks for reading, and I wish everyone good health during the global pandemic situation.

Best regards, Seung-won Park

Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample

Listen to audio sample at webpage: http://swpark.me/voicefilter/

Metric

Median SDR	Paper	Ours
before VoiceFilter	2.5	1.9
after VoiceFilter	12.6	10.2

SDR converged at 10, which is slightly lower than paper's.

Dependencies

Python and packages

This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:
```
pip install -r requirements.txt
```
Miscellaneous

ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

Download LibriSpeech dataset

To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

Resample & Normalize wav files

First, unzip tar.gz file to desired folder:

tar -xvzf train-clear-360.tar.gz

Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

vim normalize-resample.sh # set "N" as your CPU core number.
chmod a+x normalize-resample.sh
./normalize-resample.sh # this may take long

Edit config.yaml

cd config
cp default.yaml config.yaml
vim config.yaml

Preprocess wav files

In order to boost training speed, perform STFT for each files before training by:
```
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
```
This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

Get pretrained model for speaker recognition system

VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.

Update: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.

The model can be downloaded at this GDrive link.
Run

After specifying train_dir, test_dir at config.yaml, run:
```
python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
```
This will create chkpt/name and logs/name at base directory(-b option, . in default)
View tensorboardX
```
tensorboard --logdir ./logs
```

Resuming from checkpoint

python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name

Evaluate

python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

Try power-law compressed reconstruction error as loss function, instead of MSE. (See #14)

Author

Seungwon Park at MINDsLab ([email protected], [email protected])

License

Apache License 2.0

This repository contains codes adapted/copied from the followings:

utils/adabound.py from https://github.com/Luolc/AdaBound (Apache License 2.0)
utils/audio.py from https://github.com/keithito/tacotron (MIT License)
utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)
utils/normalize-resample.sh from https://unix.stackexchange.com/a/216475

Comments

VoiceFilter realization problem
Seungwon, hello.

My name is Vladimir. I am a researcher at Speech Technology Center, Russia, St. Petersburg. Your implementation of the VoiceFilter algorithm (https://github.com/mindslab-ai/voicefilter) is very interesting to me and my colleagues. Unfortunately, we could not get the SDR metric dynamics like yours, using your code with the standard settings in the default.yaml file. SDR converged to 4.5 dB after 200k iterations (see figure below), but not to 10 dB after 65k as in your results. Could you tell us your training settings, as well as the neural network architecture that you used to get your result?

Our python environment:

tqdm (ver. 4.32.1);

numpy (ver. 1.16.3);

torch (ver. 1.1.0);

pyyaml (ver. 5.1);

librosa (ver. 0.6.3);

mir_eval (ver. 0.5);

matplotlib (ver. 3.1.0);

tensorboardX (ver. 1.7);

ffmpeg (ver. 4.1.3);

ffmpeg_normalize (1.14.0);

python (ver. 3.6).

We use four Nvidia GeForce GTX 1080 Ti when training one VoiceFilter's model. Subsets train-clean-100, train-clean-360 and train-other-500 from LibriSpeech dataset are used to train VoiceFilter's model and dev-clean is used to test VoiceFilter's model. We use the pretrained d-vector model to encode the target speaker.

We used your default configuration file:

audio: n_fft: 1200 num_freq: 601 sample_rate: 16000 hop_length: 160 win_length: 400 min_level_db: -100.0 ref_level_db: 20.0 preemphasis: 0.97 power: 0.30 model: lstm_dim: 400 fc1_dim: 600 fc2_dim: 601 data: train_dir: 'path/to/train/data' test_dir: 'path/to/test/data' audio_len: 3.0 form: input: '*-norm.wav' dvec: '*-dvec.txt' target: wav: '*-target.wav' mag: '*-target.pt' mixed: wav: '*-mixed.wav' mag: '*-mixed.pt' train: batch_size: 8 num_workers: 16 optimizer: 'adam' adam: 0.001 adabound: initial: 0.001 final: 0.05 summary_interval: 1 checkpoint_interval: 1000 log: chkpt_dir: 'chkpt' log_dir: 'logs' embedder: num_mels: 40 n_fft: 512 emb_dim: 256 lstm_hidden: 768 lstm_layers: 3 window: 80 stride: 40

The neural network architecture was standard and followed your implementation.
opened by va-volokhov 8
Out of memory when Inferencing a single file.

I tried to try the trained model on a single input and it gave OOM on GCP with 1 Nvidia P100. RuntimeError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.90 GiB total capacity; 14.37 GiB already allocated; 889.81 MiB free; 19.21 MiB cached) The file size of the mixed wav(19 MB) file was about 5 minutes and for reference file was 11 seconds. I don't know why it shows 14.37 GiB allocated when not even training. I tried to restart the instance but it did not help. Can you please suggest a way to reduce the memory required while Inference? Thank you!

opened by BhaveshDevjani 5
embedder.pt with new dataset

Hi, if in case I wanted to use another dataset of audio files for the training and the test (not the one used here) the embedder.pt that I have to insert when I run "trainer.py" as I can generate it or which one I have to use ? Thank you

opened by Devid29 4
Question about normalize-resample.sh
Thank you for your great job! I have a question when I tried to run the project. I set 'N' as my CPU core number , then I input 'chmod a+x normalize-resample.sh'. However, after I input './normalize-resample.sh ', there is no output on the command line. Is this normal? Furthermore, what is the function of this script?

Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then: vim normalize-resample.sh # set "N" as your CPU core number. chmod a+x normalize-resample.sh ./normalize-resample.sh # this may take long

Looking for your reply!
opened by yxixi 4
Question about start point of SDR

Dear @seungwonpark

First of all, I would like to thank you for great open source. I would like to test your nice code and I tried to train voice filter.

But i get the problem with SDR. When i saw SDR graph in voicefilter github, SDR value from 2 to 10dB. But in my case, SDR value is from -0.8 to 1.2.

I am trying to find the cause of the problem but I can not find it.

Can you help me to find the cause of problem?

I used the default yaml and generator.py. ( train-clean-100, train-clean-360, dev-clean are used to train)

Could you let me know what i can check?

Thanks you!

opened by lycox1 4
Real-time inference

Hi, I'd like to use this voice filtering in real-time. Would it be possible to modify the inference code to run the model in real time for audio PCM data?

opened by kyungjin-lee 4
Adding specific versions to librosa and numba

Newer packages of those libraries have changed things and the methods used in the code are no longer available. Setting specific versions in the requirements file avoids installing those versions.

opened by johannesmols 3
install Pillow for PIL

Yes, it gives import error on my new venv.

p.s. Doesn't know if I'm running it wrong, it took 3 hours to run only 1.4k updates on my 1080ti. And the SDR is even negative number.

opened by stegben 2
the model implementation comprehension

Hello, I'm a master student in ITMO university in Saint-Petersburg, Russia.

Could you explain me please, what exactly this model implemenation do? As for me (variant 1) it takes as input mixed sound of voice of a person A and voice of a person B and clear voice A, the same as in mixed one and trying to extract it from the mixed one. (that is really strange because it is useless) And in the paper (variant 2) it is said that it should take the mixed one and clear voice of the target person but NOT the same sound as in mixed one! And this is the point.

When I tried to look at train test, made by generator, I found out that in every example of ******-mixed.wav there is ******-target.wav with another voice! (but not another phrase of target person as I thought it should be)

Am I right? Or what's going on here?

Waiting for your answer, thank you!

opened by kurbobo 1
fix pytorch version to 1.0.1

It's 1.0.2 on pypi and 1.1.0 on the way.

p.s. It would be great if you plan to migrate from requirements.txt to pipenv for versioning. It would help you resolve version and lock them for you automatically.

opened by stegben 1
Question when training VoiceFilter
Hi, it's me again:) Because of insufficient computer storage,I skipped the following step:

Preprocess wav files In order to boost training speed, perform STFT for each files before training by: python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run] This will create 100,000(train) + 1000(test) data. (About 160G)

Then I downloaded embedder.pt, train-clear-100.tar.gz and dev-clean.tar.gz. I unziped tar.gz files and put those unzipped file folders to the root directory of voicefilter.I also specifying train_dir, test_dir at config.yaml, such as:

train_dir: '/home/../voicefilter/train/train-clean-100' test_dir: '/home/../voicefilter/dev/dev-clean'

After that, when I enter this instruction:

python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

Error prompts pop up on the screen:AssertionError: no training file found

I want to know which step I made a mistake, or what configuration was missing? Thanks! REALLY looking forward to your reply!
opened by yxixi 1
Cannot reproduce reported SDR & retrain the speaker embedding
Hello, I have two questions about the implementation.

I cannot reproduce the results reported in the README. I have trained for around > 400k steps on Librispeech 360h + 100h clean dataset, using the embedder provided in this repo. However, I can only obtain up to a maximum SDR of 5.5.

To obtain data from the Librispeech 360h + 100h, I generate the mixed audios for 360h and 100h separately, then add them together in another folder. Is this the right way when I want to use more data to train the voice filter module?

I got worse results when retraining the speaker embedding I retrained the embedder using the following repo: Speaker verification on 3 datasets: Librispeech, VoxCeleb1, VoxCeleb2.

Theoretically, I expect the voice filter module will benefit from the embedder trained on more data, but the results got even worse. Can you share how you train this embedder?

Thank you in advance!
opened by nnbtam99 0
Question about wav2spec function in utils/audo.py

I have noticed that there is amp_to_db function in wav2spec, which means that the input of model is db-scaled magnitude. I s this right? Since this is not mentioned in related paper.

opened by HieDean 0
question about ffmpeg-normalize

hi~ i meet with a problem when doing ./normalize-resample.sh , it seems that wav in /tmp did not exit, i try to fix it but failed, does anyone know where the problem is ? i also run the commend " ffmpeg-normalize 1.wav -o 1-norm.wav" to test this normalize tool , and occured the same question ; how to make this normalize tool (ffmpeg-normalize） work？ yingyingying

opened by somepayphone 0
Training setting problem

Hi,

Thank you for publishing your code! I am encountering a training problem. As an initial phase I have tried to train only on 1000 samples from LibriSpeech train-clean-100 dataset. I am using the default configuration as was published in your VoiceFilter repo. The only difference is that I used batch size of 6 due to memory limitations. Is it possible that the problem is related to the small batch size that I use?

Another question is related to the generation of the training and testing sets. I have noticed that there is an option to use a VAD for generating the training set but by default it is not used. What is the best practice? to use the VAD or not?

I appreciate your help!

opened by Morank88 6
Can you get the initial mean SDR on LibriSpeech using Google's test list?

Hi, seungwonpark,

I was trying to use Google's posted test list for LibriSpeech to reproduce their results. But I can not even get their initial mean SDR (10.1 dB in their paper). I got only 1.5 dB. I am wondering have you tried their list and got around 10.1 dB for mean SDR before applying voice filter?

Thank you so much.

opened by weedwind 8

Owner

MINDs Lab

MINDsLab provides AI platform and various AI engines based on deep machine learning.

GitHub http://swpark.me/voicefilter

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

1 Aug 19, 2021

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

1.2k Dec 23, 2022

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

33 Sep 22, 2022

Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

5.3k Jan 7, 2023

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

12 Dec 23, 2022

Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

94 Dec 25, 2022

Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

75 Dec 27, 2022

Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

85 Dec 16, 2022

Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

3.1k Dec 29, 2022

Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

2.6k Feb 17, 2021

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 6, 2023

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

easySpeech easySpeech is an open source python wrapper for google speech to text api that doesn't require PyAaudio(So you specially windows user don't

14 May 24, 2022

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Related tags

Overview

VoiceFilter

Note from Seung-won (2020.10.25)

Result

Audio Sample

Metric

Dependencies

Prepare Dataset

Train VoiceFilter

Evaluate

Possible improvments

Author

License

Comments

Owner

MINDs Lab

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Google AI 2018 BERT pytorch implementation

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Google's Meena transformer chatbot implementation

Implementation of legal QA system based on SentenceKoBART

Search for documents in a domain through Google. The objective is to extract metadata

Python port of Google's libphonenumber

Python port of Google's libphonenumber

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

Command Line Text-To-Speech using Google TTS

Google and Stanford University released a new pre-trained model called ELECTRA

Every Google, Azure & IBM text to speech voice for free

Fine-tune GPT-3 with a Google Chat conversation history

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.