Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

Related tags

MaskCycleGAN-VC
Overview

MaskCycleGAN-VC

Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion.

MaskCycleGAN-VC is the state of the art method for non-parallel voice conversion using CycleGAN. It is trained using a novel auxiliary task of filling in frames (FIF) by applying a temporal mask to the input Mel-spectrogram. It demonstrates marked improvements over prior models such as CycleGAN-VC (2018), CycleGAN-VC2 (2019), and CycleGAN-VC3 (2020).


Figure1: MaskCycleGAN-VC Training




Figure2: MaskCycleGAN-VC Generator Architecture




Figure3: MaskCycleGAN-VC PatchGAN Discriminator Architecture



Paper: https://arxiv.org/pdf/2102.12841.pdf

Repository Contributors: Claire Pajot, Hikaru Hotta, Sofian Zalouk

Setup

Clone the repository.

git clone [email protected]:GANtastic3/MaskCycleGAN-VC.git
cd MaskCycleGAN-VC

Create the conda environment.

conda env create -f environment.yml
conda activate MaskCycleGAN-VC

VCC2018 Dataset

The authors of the paper used the dataset from the Spoke task of Voice Conversion Challenge 2018 (VCC2018). This is a dataset of non-parallel utterances from 6 male and 6 female speakers. Each speaker utters approximately 80 sentences.

Download the dataset from the command line.

wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip?sequence=2&isAllowed=y
wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip?sequence=3&isAllowed=y
wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_reference.zip?sequence=5&isAllowed=y

Unzip the dataset file.

mkdir vcc2018
apt-get install unzip
unzip vcc2018_database_training.zip?sequence=2 -d vcc2018/
unzip vcc2018_database_evaluation.zip?sequence=3 -d vcc2018/
unzip vcc2018_database_reference.zip?sequence=5 -d vcc2018/
mv -v vcc2018/vcc2018_reference/* vcc2018/vcc2018_evaluation
rm -rf vcc2018/vcc2018_reference

Data Preprocessing

To expedite training, we preprocess the dataset by converting waveforms to melspectograms, then save the spectrograms as pickle files normalized.pickle and normalization statistics (mean, std) as npz files _norm_stats.npz. We convert waveforms to spectrograms using a melgan vocoder to ensure that you can decode voice converted spectrograms to waveform and listen to your samples during inference.

python data_preprocessing/preprocess_vcc2018.py \
  --data_directory vcc2018/vcc2018_training \
  --preprocessed_data_directory vcc2018_preprocessed/vcc2018_training \
  --speaker_ids VCC2SF1 VCC2SF2 VCC2SF3 VCC2SF4 VCC2SM1 VCC2SM2 VCC2SM3 VCC2SM4 VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2
python data_preprocessing/preprocess_vcc2018.py \
  --data_directory vcc2018/vcc2018_evaluation \
  --preprocessed_data_directory vcc2018_preprocessed/vcc2018_evaluation \
  --speaker_ids VCC2SF1 VCC2SF2 VCC2SF3 VCC2SF4 VCC2SM1 VCC2SM2 VCC2SM3 VCC2SM4 VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2

Training

Train MaskCycleGAN-VC to convert between and . You should start to get excellent results after only several hundred epochs.

python -W ignore::UserWarning -m mask_cyclegan_vc.train \
    --name mask_cyclegan_vc__ \
    --seed 0 \
    --save_dir results/ \
    --preprocessed_data_dir vcc2018_preprocessed/vcc2018_training/ \
    --speaker_A_id  \
    --speaker_B_id  \
    --epochs_per_save 100 \
    --epochs_per_plot 10 \
    --num_epochs 6172 \
    --batch_size 1 \
    --lr 5e-4 \
    --decay_after 1e4 \
    --sample_rate 22050 \
    --num_frames 64 \
    --max_mask_len 25 \
    --gpu_ids 0 \

To continue training from a previous checkpoint in the case that training is suspended, add the argument --continue_train while keeping all others the same. The model saver class will automatically load the most recently saved checkpoint and resume training.

Launch Tensorboard in a separate terminal window.

tensorboard --logdir results/logs

Testing

Test your trained MaskCycleGAN-VC by converting between and on the evaluation dataset. Your converted .wav files are stored in results//converted_audio.

python -W ignore::UserWarning -m mask_cyclegan_vc.test \
    --name mask_cyclegan_vc_VCC2SF3_VCC2TF1 \
    --save_dir results/ \
    --preprocessed_data_dir vcc2018_preprocessed/vcc2018_evaluation \
    --gpu_ids 0 \
    --speaker_A_id VCC2SF3 \
    --speaker_B_id VCC2TF1 \
    --ckpt_dir /data1/cycleGAN_VC3/mask_cyclegan_vc_VCC2SF3_VCC2TF1/ckpts \
    --load_epoch 500 \
    --model_name generator_A2B \

Toggle between A->B and B->A conversion by setting --model_name as either generator_A2B or generator_B2A.

Select the epoch to load your model from by setting --load_epoch.

Code Organization

├── README.md                       <- Top-level README.
├── environment.yml                 <- Conda environment
├── .gitignore
├── LICENSE
|
├── args
│   ├── base_arg_parser             <- arg parser
│   ├── train_arg_parser            <- arg parser for training (inherits base_arg_parser)
│   ├── cycleGAN_train_arg_parser   <- arg parser for training MaskCycleGAN-VC (inherits train_arg_parser)
│   ├── cycleGAN_test_arg_parser    <- arg parser for testing MaskCycleGAN-VC (inherits base_arg_parser)
│
├── bash_scripts
│   ├── mask_cyclegan_train.sh      <- sample script to train MaskCycleGAN-VC
│   ├── mask_cyclegan_test.sh       <- sample script to test MaskCycleGAN-VC
│
├── data_preprocessing
│   ├── preprocess_vcc2018.py       <- preprocess VCC2018 dataset
│
├── dataset
│   ├── vc_dataset.py               <- torch dataset class for MaskCycleGAN-VC
│
├── logger
│   ├── base_logger.sh              <- logging to Tensorboard
│   ├── train_logger.sh             <- logging to Tensorboard during training (inherits base_logger)
│
├── saver
│   ├── model_saver.py              <- saves and loads models
│
├── mask_cyclegan_vc
│   ├── model.py                    <- defines MaskCycleGAN-VC model architecture
│   ├── train.py                    <- training script for MaskCycleGAN-VC
│   ├── test.py                     <- training script for MaskCycleGAN-VC
│   ├── utils.py                    <- utility functions to train and test MaskCycleGAN-VC

Acknowledgements

This repository was inspired by jackaduma's implementation of CycleGAN-VC2.

Issues
  • Much worse results than in paper

    Much worse results than in paper

    Hi,

    First of all, thank you for this nice implementation.

    I trained the network with default settings and data (~500k iteration), but the results are really unnaturalistic (eg.: link) and far from the samples provided by the author of the paper: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

    Why is this? Did you experience the same or you got nice results?

    opened by terbed 21
  • Discussions about the mask implementation

    Discussions about the mask implementation

    Thank you for your code! I found a confusion: According to the original paper, the mask is used in the source mel-spectrongram instead of target mel-spectrongram, while both Mel-spectrongrams use mask in your scripts. Why does target side also uses mask? Looking forward to your reply!

    opened by youngboy52 3
  • Can't set it up on windows

    Can't set it up on windows

    Tried to run the conda command but it seems to have a problem getting the packages. what should I do?

    opened by BenRafatian 2
  • decay_after  and identity_loss stopping parameters are different

    decay_after and identity_loss stopping parameters are different

    The original article says:

    We used L_id only for the first 1e4 iterations to guide the learning direction. 
    We kept the same learning rate for the first 2e5 iterations and linearly decay over the next 2e5 iterations.
    

    But in the code decay_after = 1e4 which also controls the identity constraint removal.

    I corrected this issue with adding a new argument: stop_identity_after

    There are also minor updates with no effects and an experimental update regarding a spectral norm in the discriminator model which is to be discarded.

    enhancement 
    opened by terbed 2
  • hardware environment

    hardware environment

    Hi, I was very surprised to find this project. Would you mind telling me what the configuration of your experimental hardware environment is? Such as how many RAM is needed for the graphics card. And, how long it will take after several hundred epochs training to get excellent results? Thank you.

    opened by mengxiangjiacherry 2
  • load model more than 1000 epochs

    load model more than 1000 epochs

    If I want to continue the train after it have train more than 1,000 epochs for example 1,200, set the --continue_train and set --load_epoch as 1200 in train.py, it will have a error path. In model_saver.py line 76, it set the path to model to be loaded have only three digits.

    bug 
    opened by masaikk 1
  • Wrong spectrogram scaling in inference script

    Wrong spectrogram scaling in inference script

    In the inference script the scaling back with speaker statistics is wrong:

                wav_fake_B = decode_melspectrogram(self.vocoder, fake_B[0].detach(
                ).cpu(), self.dataset_A_mean, self.dataset_A_std).cpu()
    

    This should be scaled with B speaker statistics.

    This correction largely improves the performance of the inference.

    Solves issue #3 .

    bug 
    opened by terbed 1
  • Optimal duration of audio for source and target

    Optimal duration of audio for source and target

    Hello, What is the optimal duration? Is there any advantage to using 4 hour for a target and 2 hours for a source? or 4 hours for the source? I can definitely hear differences when the duration is longer, e.g 30 minutes vs 2 hours.

    Thanks

    opened by faranaziz 1
  • extended model ckpt epoch to 5 digits

    extended model ckpt epoch to 5 digits

    User reported bug in which continuing training beyond epoch 999 was not working as we hard coded the epoch to be displayed in 3 digits. Extended this to 5 digits to allow longer training.

    bug 
    opened by HikaruHotta 0
  • Fixed training loop's number of iterations

    Fixed training loop's number of iterations

    Hey, I noticed this minor detail in the code. Since start_epoch is equal to 1 by default, if, for example, num_epochs is specified to be 50, the loop will exit after completing the 49th epoch.

    opened by KarlHajal 0
  • Segmentation fault (core dumped)

    Segmentation fault (core dumped)

    @HikaruHotta what's the epoch value for training as 6172 is very big number for where each epoch takes 15min on GPU? So for trying version, I put epoch to 10 and after the completion of epoch it gave "Segmentation fault (core dumped)". Please tell me, how can I rectify the error?

    I ran the model on AWS Sage Maker as my laptop gave 55min epoch on CPU...

    opened by 44aayush 0
  • same content and voice in testing

    same content and voice in testing

    Hi everyone so I started testing after the debug @HikaruHotta did .and i reached at gloss 7.5 and dloss 0.29 so i decided to convert an output and see if the result is good . but there is a problem here .my output wav file is from speaker s1 to s2 but its the same evaluation file i have for s1 added some noise .its no VC its just the same file with some noise. can u tell me whats the problem?

    opened by todalex 3
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 56 Jun 1, 2021
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

null 22 Jun 9, 2021
NR-GAN: Noise Robust Generative Adversarial Networks

NR-GAN: Noise Robust Generative Adversarial Networks (CVPR 2020) This repository provides PyTorch implementation for noise robust GAN (NR-GAN). NR-GAN

Takuhiro Kaneko 37 May 24, 2021
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.4k Jun 10, 2021
Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

Liu Songxiang 19 Jun 14, 2021
tf2onnx - Convert TensorFlow, Keras and Tflite models to ONNX.

tf2onnx converts TensorFlow (tf-1.x or tf-2.x), tf.keras and tflite models to ONNX via command line or python api.

Open Neural Network Exchange 1k Jun 13, 2021
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 63 Jun 11, 2021
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

null 20 May 29, 2021
A fast MoE impl for PyTorch

An easy-to-use and efficient system to support the Mixture of Experts (MoE) model for PyTorch.

Rick Ho 423 Jun 16, 2021
Try out deep learning models online on Google Colab

Try out deep learning models online on Google Colab

Erdene-Ochir Tuguldur 1.1k Jun 13, 2021
Woosung Choi 34 Jun 1, 2021
CVPRW 2021: How to calibrate your event camera

E2Calib: How to Calibrate Your Event Camera This repository contains code that implements video reconstruction from event data for calibration as desc

Robotics and Perception Group 51 Jun 11, 2021
Code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

Deep Deterministic Uncertainty This repository contains the code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic

Jishnu Mukhoti 16 Jun 13, 2021
Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Face Recognition Using Pytorch Python 3.7 3.6 3.5 Status This is a repository for Inception Resnet (V1) models in pytorch, pretrained on VGGFace2 and

Tim Esler 2.1k Jun 13, 2021