Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

Last update: Dec 25, 2022

Related tags

Deep Learning MaskCycleGAN-VC

Overview

MaskCycleGAN-VC

Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion.

MaskCycleGAN-VC is the state of the art method for non-parallel voice conversion using CycleGAN. It is trained using a novel auxiliary task of filling in frames (FIF) by applying a temporal mask to the input Mel-spectrogram. It demonstrates marked improvements over prior models such as CycleGAN-VC (2018), CycleGAN-VC2 (2019), and CycleGAN-VC3 (2020).

Figure1: MaskCycleGAN-VC Training

Figure2: MaskCycleGAN-VC Generator Architecture

Figure3: MaskCycleGAN-VC PatchGAN Discriminator Architecture

Paper: https://arxiv.org/pdf/2102.12841.pdf

Repository Contributors: Claire Pajot, Hikaru Hotta, Sofian Zalouk

Setup

Clone the repository.

git clone [email protected]:GANtastic3/MaskCycleGAN-VC.git
cd MaskCycleGAN-VC

Create the conda environment.

conda env create -f environment.yml
conda activate MaskCycleGAN-VC

VCC2018 Dataset

The authors of the paper used the dataset from the Spoke task of Voice Conversion Challenge 2018 (VCC2018). This is a dataset of non-parallel utterances from 6 male and 6 female speakers. Each speaker utters approximately 80 sentences.

Download the dataset from the command line.

wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip?sequence=2&isAllowed=y
wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip?sequence=3&isAllowed=y
wget --no-check-certificate https://datashare.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_reference.zip?sequence=5&isAllowed=y

Unzip the dataset file.

mkdir vcc2018
apt-get install unzip
unzip vcc2018_database_training.zip?sequence=2 -d vcc2018/
unzip vcc2018_database_evaluation.zip?sequence=3 -d vcc2018/
unzip vcc2018_database_reference.zip?sequence=5 -d vcc2018/
mv -v vcc2018/vcc2018_reference/* vcc2018/vcc2018_evaluation
rm -rf vcc2018/vcc2018_reference

Data Preprocessing

To expedite training, we preprocess the dataset by converting waveforms to melspectograms, then save the spectrograms as pickle files normalized.pickle and normalization statistics (mean, std) as npz files _norm_stats.npz. We convert waveforms to spectrograms using a melgan vocoder to ensure that you can decode voice converted spectrograms to waveform and listen to your samples during inference.

python data_preprocessing/preprocess_vcc2018.py \
  --data_directory vcc2018/vcc2018_training \
  --preprocessed_data_directory vcc2018_preprocessed/vcc2018_training \
  --speaker_ids VCC2SF1 VCC2SF2 VCC2SF3 VCC2SF4 VCC2SM1 VCC2SM2 VCC2SM3 VCC2SM4 VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2

python data_preprocessing/preprocess_vcc2018.py \
  --data_directory vcc2018/vcc2018_evaluation \
  --preprocessed_data_directory vcc2018_preprocessed/vcc2018_evaluation \
  --speaker_ids VCC2SF1 VCC2SF2 VCC2SF3 VCC2SF4 VCC2SM1 VCC2SM2 VCC2SM3 VCC2SM4 VCC2TF1 VCC2TF2 VCC2TM1 VCC2TM2

Training

Train MaskCycleGAN-VC to convert between and . You should start to get excellent results after only several hundred epochs.

python -W ignore::UserWarning -m mask_cyclegan_vc.train \
    --name mask_cyclegan_vc__ \
    --seed 0 \
    --save_dir results/ \
    --preprocessed_data_dir vcc2018_preprocessed/vcc2018_training/ \
    --speaker_A_id  \
    --speaker_B_id  \
    --epochs_per_save 100 \
    --epochs_per_plot 10 \
    --num_epochs 6172 \
    --batch_size 1 \
    --lr 5e-4 \
    --decay_after 1e4 \
    --sample_rate 22050 \
    --num_frames 64 \
    --max_mask_len 25 \
    --gpu_ids 0 \

To continue training from a previous checkpoint in the case that training is suspended, add the argument --continue_train while keeping all others the same. The model saver class will automatically load the most recently saved checkpoint and resume training.

Launch Tensorboard in a separate terminal window.

tensorboard --logdir results/logs

Testing

Test your trained MaskCycleGAN-VC by converting between and on the evaluation dataset. Your converted .wav files are stored in results//converted_audio.

python -W ignore::UserWarning -m mask_cyclegan_vc.test \
    --name mask_cyclegan_vc_VCC2SF3_VCC2TF1 \
    --save_dir results/ \
    --preprocessed_data_dir vcc2018_preprocessed/vcc2018_evaluation \
    --gpu_ids 0 \
    --speaker_A_id VCC2SF3 \
    --speaker_B_id VCC2TF1 \
    --ckpt_dir /data1/cycleGAN_VC3/mask_cyclegan_vc_VCC2SF3_VCC2TF1/ckpts \
    --load_epoch 500 \
    --model_name generator_A2B \

Toggle between A->B and B->A conversion by setting --model_name as either generator_A2B or generator_B2A.

Select the epoch to load your model from by setting --load_epoch.

Code Organization

├── README.md                       <- Top-level README.
├── environment.yml                 <- Conda environment
├── .gitignore
├── LICENSE
|
├── args
│   ├── base_arg_parser             <- arg parser
│   ├── train_arg_parser            <- arg parser for training (inherits base_arg_parser)
│   ├── cycleGAN_train_arg_parser   <- arg parser for training MaskCycleGAN-VC (inherits train_arg_parser)
│   ├── cycleGAN_test_arg_parser    <- arg parser for testing MaskCycleGAN-VC (inherits base_arg_parser)
│
├── bash_scripts
│   ├── mask_cyclegan_train.sh      <- sample script to train MaskCycleGAN-VC
│   ├── mask_cyclegan_test.sh       <- sample script to test MaskCycleGAN-VC
│
├── data_preprocessing
│   ├── preprocess_vcc2018.py       <- preprocess VCC2018 dataset
│
├── dataset
│   ├── vc_dataset.py               <- torch dataset class for MaskCycleGAN-VC
│
├── logger
│   ├── base_logger.sh              <- logging to Tensorboard
│   ├── train_logger.sh             <- logging to Tensorboard during training (inherits base_logger)
│
├── saver
│   ├── model_saver.py              <- saves and loads models
│
├── mask_cyclegan_vc
│   ├── model.py                    <- defines MaskCycleGAN-VC model architecture
│   ├── train.py                    <- training script for MaskCycleGAN-VC
│   ├── test.py                     <- training script for MaskCycleGAN-VC
│   ├── utils.py                    <- utility functions to train and test MaskCycleGAN-VC

Acknowledgements

This repository was inspired by jackaduma's implementation of CycleGAN-VC2.

Comments

Much worse results than in paper

Hi,

First of all, thank you for this nice implementation.

I trained the network with default settings and data (~500k iteration), but the results are really unnaturalistic (eg.: link) and far from the samples provided by the author of the paper: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

Why is this? Did you experience the same or you got nice results?

opened by terbed 21
Input feature

Hi!

Thanks for sharing code.

Why did you use vocoder() funtion to extract features? These features are mel-spectrograms, right? Why didn't use just librosa? I tried and differences between extracted featurs are not negligible. Ranges are so different.

opened by EmreOzkose 5
Discussions about the mask implementation

Thank you for your code! I found a confusion: According to the original paper, the mask is used in the source mel-spectrongram instead of target mel-spectrongram, while both Mel-spectrongrams use mask in your scripts. Why does target side also uses mask? Looking forward to your reply!

opened by youngboy52 3
How to use it with non-parallel data?

I was wondering that if I could use 2 sets of non-parallel voice datasets, where each speaker utters completely different sentences, and there is no transcript of the data.

also, does this program use end-end conversion, or does it train a speech-to-text model and then recreates the speech in the converted voice from the text?

opened by BenRafatian 2
hardware environment

Hi, I was very surprised to find this project. Would you mind telling me what the configuration of your experimental hardware environment is? Such as how many RAM is needed for the graphics card. And, how long it will take after several hundred epochs training to get excellent results? Thank you.

opened by mengxiangjiacherry 2
decay_after and identity_loss stopping parameters are different
The original article says:

We used L_id only for the first 1e4 iterations to guide the learning direction. We kept the same learning rate for the first 2e5 iterations and linearly decay over the next 2e5 iterations.

But in the code decay_after = 1e4 which also controls the identity constraint removal.

I corrected this issue with adding a new argument: stop_identity_after

There are also minor updates with no effects and an experimental update regarding a spectral norm in the discriminator model which is to be discarded.
enhancement
opened by terbed 2
Controllable Generation

Hi,

Thanks for the well-documented implementation and the outputs from the model are really good. I know the original paper was built for 1 to 1 conversion but I'm wondering if there's any possibility to control the output by manipulating the z vector input for the generator by using a pre-trained classifier. For example, if I train one source speaker to multiple target speakers(recordings of multiple speakers in one dataset acting as one target) and train a classifier to differentiate each speaker, will it be possible to control the output for the target we want. I'm just wondering whether the structure of this model will allow it? A reply would be highly appreciated since I'm trying to use this implementation for my undergraduate final year project in a couple of months. Thanks!

opened by Pasinduekanayake 1
Error for starting to train, and not using GPU for training

train.py: error: unrecognized arguments: --lr 5e-4

I've done everything as said in the description but this Error happens. I'm running Ubuntu 20.0.4 2080ti

opened by BenRafatian 1
Optimal duration of audio for source and target

Hello, What is the optimal duration? Is there any advantage to using 4 hour for a target and 2 hours for a source? or 4 hours for the source? I can definitely hear differences when the duration is longer, e.g 30 minutes vs 2 hours.

Thanks

opened by faranaziz 1
Wrong spectrogram scaling in inference script
In the inference script the scaling back with speaker statistics is wrong:

wav_fake_B = decode_melspectrogram(self.vocoder, fake_B[0].detach( ).cpu(), self.dataset_A_mean, self.dataset_A_std).cpu()

This should be scaled with B speaker statistics.

This correction largely improves the performance of the inference.

Solves issue #3 .
bug
opened by terbed 1
load model more than 1000 epochs

If I want to continue the train after it have train more than 1,000 epochs for example 1,200, set the --continue_train and set --load_epoch as 1200 in train.py, it will have a error path. In model_saver.py line 76, it set the path to model to be loaded have only three digits.
bug

opened by masaikk 1
High pitch source not working in MaskCycleGAN

HIi @hikaruhotta @KarlHajal @terbed, Thank you so much for your excellent work. very nice paper.

I tried High pitch dialogues for the source to convert Target audio. but it didn't work.

May I know any reason for this? and How can we overcome this issue?

Kindly request, Please give your suggestions.

Thanks

opened by MuruganR96 0
Music Conversion

Has this been used for genre style conversion or emotion conversion in music? Also, does this work for arbitrary lengths of audio or will one have to adjust the network for spectrograms that were not from the dataset used in the paper?

opened by milesigel 0
same content and voice in testing

Hi everyone so I started testing after the debug @HikaruHotta did .and i reached at gloss 7.5 and dloss 0.29 so i decided to convert an output and see if the result is good . but there is a problem here .my output wav file is from speaker s1 to s2 but its the same evaluation file i have for s1 added some noise .its no VC its just the same file with some noise. can u tell me whats the problem?

opened by todalex 3

Owner

GitHub

Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

1 Jan 12, 2022

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

31 Dec 7, 2022

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S2VC Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In thi

81 Dec 15, 2022

Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

9 Nov 18, 2022

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

110 Dec 24, 2022

An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022

Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

227 Dec 28, 2022

A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection 1. 介绍用以替代 NMS，在所有 bbox 中挑选出最优的集合。 NMS 仅考虑了 bbox 的得分，然后根据 IOU 来

44 Sep 15, 2022

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

3 Jun 22, 2022

A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

1 Oct 30, 2021

Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

10 Dec 26, 2022

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

5.7k Jan 9, 2023

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

Core ML Tools Use coremltools to convert machine learning models from third-party libraries to the Core ML format. The Python package contains the sup

3k Jan 8, 2023

Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

4 Aug 24, 2022

This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Stock Market Buy/Sell/Hold prediction Using convolutional Neural Network This repo is an attempt to implement the research paper titled "Algorithmic F

136 Dec 28, 2022