StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Overview

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Yinghao Aaron Li, Ali Zare, Nima Mesgarani

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

Paper: https://arxiv.org/abs/2107.10394

Audio samples: https://starganv2-vc.github.io/

Pre-requisites

  1. Python >= 3.7
  2. Clone this repository:
git https://github.com/yl4579/StarGANv2-VC.git
cd StarGANv2-VC
  1. Install python requirements:
pip install SoundFile torchaudio munch parallel_wavegan torch pydub
  1. Download and extract the VCTK dataset and use VCTK.ipynb to prepare the data (downsample to 24 kHz etc.). You can also download the dataset we have prepared and unzip it to the Data folder, use the provided config.yml to reproduce our models.

Training

python train.py --config_path ./Configs/config.yml

Please specify the training and validation data in config.yml file. Change num_domains to the number of speakers in the dataset. The data list format needs to be filename.wav|speaker_number, see train_list.txt as an example.

Checkpoints and Tensorboard logs will be saved at log_dir. To speed up training, you may want to make batch_size as large as your GPU RAM can take. However, please note that batch_size = 5 will take around 10G GPU RAM.

Inference

Please refer to inference.ipynb for details.

The pretrained StarGANv2 and ParallelWaveGAN on VCTK corpus can be downloaded at StarGANv2 Link and ParallelWaveGAN Link. Please unzip to Models and Vocoder respectivey and run each cell in the notebook.

ASR & F0 Models

The pretrained F0 and ASR models are provided under the Utils folder. Both the F0 and ASR models are trained with melspectrograms preprocessed using meldataset.py, and both models are trained on speech data only.

The ASR model is trained on English corpus, but it appears to work when training StarGANv2 models in other languages such as Japanese. The F0 model also appears to work with singing data. For the best performance, however, training your own ASR and F0 models is encouraged for non-English and non-speech data.

You can edit the meldataset.py with your own melspectrogram preprocessing, but the provided pretrained models will no longer work. You will need to train your own ASR and F0 models with the new preprocessing. You may refer to repo Diamondfan/CTC_pytorch and keums/melodyExtraction_JDC to train your own the ASR and F0 models, for example.

References

Acknowledgement

The author would like to thank @tosaka-m for his great repository and valuable discussions.

Comments
  • How to improve multilingual singing voice conversion?

    How to improve multilingual singing voice conversion?

    The performance of speaking conversion is good, and the singing conversion is not ideal. If I do singing voice conversion, can you teach me how to use hifigan, hififan also has a pre-model with the same parameters. Do you have any plans to upgrade the singing conversion next?

    opened by MMMMichaelzhang 9
  • Is there pretrained HIFI gan vocoder?

    Is there pretrained HIFI gan vocoder?

    Thank you for share this great repository.

    I tried some song to convert voice, but there is problem with high pitch.

    The pretrained vocoder can not express high pitch..

    So If you have the pretrained hifi pkl file with proper config, please share..

    Thanks.

    opened by netuserjun 8
  •  Doubts about sampling rate

    Doubts about sampling rate

    I noticed that the torchaudio.transforms.MelSpectrogram you used is 16000 sampling rate, but the wav read is 24000 sampling rate. In other words, you use a mel with a sampling rate of 16,000 for audio with a sampling rate of 24,000 and use it as the target.Will this affect?

    opened by 980202006 7
  • Compatibility with custom vocoder checkpoints?

    Compatibility with custom vocoder checkpoints?

    Greetings!

    I've tried to replace your supplied pretrained ParallelWaveGAN checkpoint with a different one I trained (Using the implementation over at https://github.com/kan-bayashi/ParallelWaveGAN ), to go along with a custom StarGANv2-VC checkpoint.

    I copied the parameters from the config.yml that you supply with your pretrained checkpoint exactly, and used your checkpoint for finetuning.

    However, the resulting vocoder checkpoint cannot use the output of StarGANv2-VC correctly - it produces near-clipping, way too low-frequency-centric output, even when running over the original wave files for testing.

    After some investigation (and a lot of headache), it seems that the mel spectrograms produced by StarGANv2-VC use a different log base and are not compatible. So I trained a PWG vocoder with log() instead of the default log10(), but this also did not yield acceptable results.

    It seems that the normalization you use for StarGANv2-VC is different, also. (?) Your own vocoder checkpoint was trained with those changes implemented, since it works fine out of the box. But they're not documented.

    If possible, could you please share details about what changes you made when training your ParallelWaveGAN checkpoint, so that other vocoder checkpoints may be correctly trained for use with StarGANv2-VC? That would be great.

    opened by Kreevoz 7
  • Live inference on windows

    Live inference on windows

    Great production! For the live implementation of reasoning, is there any detailed code? I have tried it, but unfortunately, I still lack some experience in it. Any help will be appreciated.

    opened by Chopin68 5
  • F0 model training on 160 mels

    F0 model training on 160 mels

    Screenshot 2022-06-27 at 6 04 17 PM

    I am training the F0 model on 160 mels with the pre-trained f0 model that you provided as a base model. training data is Librispeech-clean-100. is the model overfitting only in 10-20 epochs?

    opened by ashfaquekhalid 5
  • how to fine-tune  model

    how to fine-tune model

    When I fine-tuned the model, I had 20 speakers and the model was epoch_00300.pth, now I want to add 1 person, how should I set up? I changed the pretrained_model in config.yml, and then num_domains=21? can you tell me how to do it,thanks

    opened by MMMMichaelzhang 5
  • ASR training data must be same with stargan training data?

    ASR training data must be same with stargan training data?

    Hi, I am trying to training starganv2 vc , one question about dataset for training ASR and vc, can I use different dataset for ASR training and starggan ?

    opened by joan126 4
  • How much loss model will converge?

    How much loss model will converge?

    hi, I want to train a new model using mixed open source data. I would like to know, How much loss model will converge And the required gpu resources and time-consuming

    opened by 980202006 4
  • Error during training

    Error during training

    {'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} {'max_lr': 2e-06, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} {'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} {'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} {'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} {'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 1346} [train]: 100%|█████████████████████████████████████████████████████████| 1346/1346 [15:04<00:00, 1.49it/s] [eval]: 99%|█████████████████████████████████████████████████████████████▎| 92/93 [00:31<00:00, 2.88it/s] Traceback (most recent call last): File "train.py", line 156, in main() File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "train.py", line 126, in main eval_results = trainer._eval_epoch() File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/e/ai/src/StarGANv2-VC/trainer.py", line 255, in _eval_epoch g_loss, g_losses_latent = compute_g_loss( File "/mnt/e/ai/src/StarGANv2-VC/losses.py", line 106, in compute_g_loss loss_f0 = f0_loss(F0_fake, F0_real) File "/mnt/e/ai/src/StarGANv2-VC/losses.py", line 212, in f0_loss x_mean = compute_mean_f0(x_f0) File "/mnt/e/ai/src/StarGANv2-VC/losses.py", line 203, in compute_mean_f0 f0_mean = f0_mean.expand(f0.shape[-1], f0_mean.shape[0]).transpose(0, 1) # (B, M) IndexError: tuple index out of range

    opened by CrackerHax 3
  • how to change pitch

    how to change pitch

    Hi,thanks for this project.It is amazing.When I was thinking about converting between male and female voices, I didn't know how to tune the pitch. Now I'm changing the male voice to a female voice, and the pitch is too high. Can you tell me where to adjust it? Thank you very much.

    opened by MMMMichaelzhang 3
  • Support needed for Teacher Forcing HiFiGAN Vocoder Finetuning

    Support needed for Teacher Forcing HiFiGAN Vocoder Finetuning

    @yl4579 Thank you so much for this wonderful work

    Support needed for Teacher Forcing HiFiGAN Vocoder Finetuning. Couldn't training converted Mel spectrogram as input to HiFiGAN fine-tuning.

    Thanks

    opened by MuruganR96 0
  • Support needed for Multiple Discriminators Implementation

    Support needed for Multiple Discriminators Implementation

        Using multiple discriminators is effective, and when the model converges, the sound quality on the unseen speaker is better, and the similarity to the target speaker is better than the original one.
    

    Originally posted by @980202006 in https://github.com/yl4579/StarGANv2-VC/issues/6#issuecomment-945602695

    opened by MuruganR96 2
  • Seems to be incompatible with the colab environment

    Seems to be incompatible with the colab environment

    Hello! It's great to have such an interesting and great project open source, I've read some articles and would like to try to train a model myself!

    This is probably a pretty simple problem to solve, but I'm not very familiar with python and this is my first attempt at it, so I'm hoping for some help!

    I followed the readme on Colab and everything seems to be working fine, but I'm stuck here

    python train.py --config_path ./Configs/config.yml

    To reduce the probability of errors, I currently chose to reproduce your VCTK model (step 4 in Pre-requisites)

    After running '!python train.py --config_path . /Configs/config.yml'

    returned the following error

    [libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
    terminate called after throwing an instance of 'google::protobuf::FatalException'
      what():  This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
    

    How can I solve this problem? Thank you!

    opened by KJZH001 3
  • For song vc what should I do

    For song vc what should I do

    Hello and thank you sharing your great work, but I have some questions.

    1. For song vc with Madarian, I tried train a new starganv2vc model with pretrained ASR and F0 model, but the result sound not well, do you have some advice ?
    2. In song vc with Madarian, do i need to retrain a ASR or F0 model ? I'm looking forward for your reply, and thank you again.
    opened by panxin801 6
  • fp16_run: true causes NaNs

    fp16_run: true causes NaNs

    Posting this as a new issue in case anybody else has the problem.

    There might be hacky workarounds for this but it seems to be a pytorch issue: https://github.com/pytorch/pytorch/issues/40497

    So I guess the answer is to train with fp16_run: false for now. Might want to note this in the README to save people time. Also the config file default is set to true, could probably set that to false.

    opened by CrackerHax 0
Owner
Aaron (Yinghao) Li
Aaron (Yinghao) Li
Code for Discriminative Sounding Objects Localization (NeurIPS 2020)

Discriminative Sounding Objects Localization Code for our NeurIPS 2020 paper Discriminative Sounding Objects Localization via Self-supervised Audiovis

null 51 Dec 11, 2022
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S2VC Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In thi

null 81 Dec 15, 2022
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 4, 2020
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

null 31 Dec 7, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

null 30 Aug 29, 2022
Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

Liu Songxiang 227 Dec 28, 2022
Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

null 9 Nov 18, 2022
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 170 Dec 27, 2022
A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection 1. 介绍 用以替代 NMS,在所有 bbox 中挑选出最优的集合。 NMS 仅考虑了 bbox 的得分,然后根据 IOU 来

null 44 Sep 15, 2022
A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

VAMSHI CHOWDARY 3 Jun 22, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
[ICCV2021] IICNet: A Generic Framework for Reversible Image Conversion

IICNet - Invertible Image Conversion Net Official PyTorch Implementation for IICNet: A Generic Framework for Reversible Image Conversion (ICCV2021). D

felixcheng97 55 Dec 6, 2022
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Jan 4, 2023
A parallel framework for population-based multi-agent reinforcement learning.

MALib: A parallel framework for population-based multi-agent reinforcement learning MALib is a parallel framework of population-based learning nested

MARL @ SJTU 348 Jan 8, 2023
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022