VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Related tags

Deep Learning vits
Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

VITS at training VITS at inference
VITS at training VITS at inference

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository
  3. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  4. Download datasets
    1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
    2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
  5. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

Comments
  • Anybody having luck fine tuning?

    Anybody having luck fine tuning?

    I'm using a clean 40hr dataset, female American, that I used on tacotron with good results. I've trained on VITS twice now and it starts over fitting around 70K. It's definitely intelligible and in the correct tone but the prosody is way off. First run had default configs. Second run I tried decreasing learning rate and lr decay. It helped some with overall loss, but still started overfitting around 70K.

    opened by TaoTeCha 18
  • KL Loss is right?

    KL Loss is right?

    When I searched KL-divergence between two Gaussians, I got this which is diffenrent from your KL loss https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians kl

    good first issue 
    opened by BridgetteSong 13
  • Training error

    Training error

    I just run train.py and got this error INFO:baker_base:{'train': {'log_interval': 200, 'eval_interval': 10000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/baker_train.txt', 'validation_files': 'filelists/baker_valid.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs\baker_base'} WARNING:baker_base:E:\vits\ is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. Traceback (most recent call last): File "train.py", line 294, in main() File "train.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "E:\vits\train.py", line 119, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "E:\vits\train.py", line 139, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader): File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 530, in next data = self._next_data() File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1224, in _next_data return self._process_data(data) File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1250, in _process_data data.reraise() File "C:\Python38\lib\site-packages\torch_utils.py", line 457, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "C:\Python38\lib\site-packages\torch\utils\data_utils\worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "C:\Python38\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\Python38\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "E:\vits\data_utils.py", line 90, in getitem return self.get_audio_text_pair(self.audiopaths_and_text[index]) File "E:\vits\data_utils.py", line 61, in get_audio_text_pair spec, wav = self.get_audio(audiopath) File "E:\vits\data_utils.py", line 67, in get_audio raise ValueError("{} {} SR doesn't match target {} SR".format( IndexError: Replacement index 2 out of range for positional args tuple Can anyone help me?

    opened by Eternity231 6
  • CPU infer slow

    CPU infer slow

    Thank you for sharing your code. I tried on my own dataset(Chinese) with config: { "train": { "log_interval": 200, "eval_interval": 1000, "seed": 1234, "epochs": 10000, "learning_rate": 2e-4, "betas": [0.8, 0.99], "eps": 1e-9, "batch_size": 32, "fp16_run": false, "lr_decay": 0.999875, "segment_size": 8192, "init_lr_ratio": 1, "warmup_epochs": 0, "c_mel": 45, "c_kl": 1.0 }, "data": { "training_files":"filelists/mt_f065_train_filelist.txt", "validation_files":"filelists/mt_f065_val_filelist.txt", "text_cleaners":["collapse_whitespace"], "max_wav_value": 32768.0, "sampling_rate": 16000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80, "mel_fmin": 0.0, "mel_fmax": null, "add_blank": true, "n_speakers": 0, "cleaned_text": false }, "model": { "inter_channels": 192, "hidden_channels": 192, "filter_channels": 768, "n_heads": 2, "n_layers": 6, "kernel_size": 3, "p_dropout": 0.1, "resblock": "1", "resblock_kernel_sizes": [3,7,11], "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], "upsample_rates": [8,8,2,2], "upsample_initial_channel": 512, "upsample_kernel_sizes": [16,16,4,4], "n_layers_q": 3, "use_spectral_norm": false, "gin_channels": 256 } }

    And synthesis on CPU in 1 batch, but the speed is not as good as using GPU. total audio length: 77.76s total cost GPU: 4.84s total cost CPU: 151.26s average rtf GPU: 0.06 average rtf CPU: 1.95

    I checked the checkpoint and found the G_*.pth file size is up to 445M.

    My question is:

    1. Is the CPU infer time correct?
    2. Is the checkpoint I'm using correct?
    3. Could you kindly give any ideas about how to make CPU infer faster?

    Thank you in advance

    opened by OnceJune 6
  • How to fix the noise during inference time?

    How to fix the noise during inference time?

    Hi Jaehyeon,

    May I ask how to fix the stochastic noise during inference time? I want some generated audio to be reproducable, thus need to fix the random noise part. Currently it seems I can only control the noise scale.

    sid = torch.LongTensor([1]) # speaker identity
    stn_tst = get_text("Tell me the answer please", hps_ms)
    
    with torch.no_grad():
        x_tst = stn_tst.unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
        audio = net_g_ms.infer(x_tst, x_tst_lengths, sid = sid, noise_scale=1, noise_scale_w=2, length_scale=1)[0][0,0].data.float().numpy()
    ipd.display(ipd.Audio(audio, rate=hps_ms.data.sampling_rate))
    
    opened by xinghua-qu 5
  • Can I train on a Vietnamese custom dataset? If yes, can you specify how to do it?

    Can I train on a Vietnamese custom dataset? If yes, can you specify how to do it?

    I got that issue when I train on a custom Vietnamese dataset. At first, I thought maybe some wav files on my dataset is corupted, but it's not? How can I solve this?

    [INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/viet_audio_text_train_filelist.txt.cleaned', 'validation_files': 'filelists/viet_audio_text_val_filelist.txt.cleaned', 'max_wav_value': 32768.0, 'text_cleaners': ['english_cleaners2'], 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/viet_base'} Traceback (most recent call last): File "train.py", line 290, in main() File "train.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/train.py", line 117, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/train.py", line 137, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 94, in getitem return self.get_audio_text_pair(self.audiopaths_and_text[index]) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 62, in get_audio_text_pair spec, wav = self.get_audio(audiopath) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 66, in get_audio audio, sampling_rate = load_wav_to_torch(filename) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/utils.py", line 134, in load_wav_to_torch sampling_rate, data = read(full_path) File "/usr/local/lib/python3.7/dist-packages/scipy/io/wavfile.py", line 609, in read return fs, data UnboundLocalError: local variable 'fs' referenced before assignment

    opened by tuannvhust 4
  • about multi-speaker data

    about multi-speaker data

    Thanks for your great work. I want to train with my own dataset. So I want to ask what does this number '67' mean in your data? And how it is calculated.

    E.g:DUMMY2/p229/p229_128.wav|67|The whole process is a vicious circle at the moment.

    opened by LH997 4
  • Mispronunciation and what is the purpose of the add_blank config ?

    Mispronunciation and what is the purpose of the add_blank config ?

    I've trained my own dataset with the default config except for add_blank option (I changed it to false). I know the add_blank option will add the 0 between the symbol ids but in my case it's been disabled. And I trained with phonemes and got 200K steps right now, but some phonemes seem to be spelled wrong. So I have some question ?

    1. What is the purpose of the add_blank config ?
    2. The reason for model to spell wrong ? How can I improve my model with the pronunciation ?
    opened by leminhnguyen 4
  • espeak

    espeak

    Hello, I am looking to make a web demo for this on gradio hub https://gradio.app/hub, could this be used instead of espeak? https://pypi.org/project/py-espeak-ng/

    opened by AK391 4
  • AssertionError: 4D tensors expect 4 values for padding

    AssertionError: 4D tensors expect 4 values for padding

    Training with 4 persons and reported ` File "train_ms.py", line 299, in main() File "train_ms.py", line 55, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/ubuntu/9/train_ms.py", line 123, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/home/ubuntu/9/train_ms.py", line 143, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(train_loader): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 971, in _next_data return self._process_data(data) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in DataLoader worker process 6. Original Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ubuntu/9/data_utils.py", line 236, in getitem return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index]) File "/home/ubuntu/9/data_utils.py", line 199, in get_audio_text_speaker_pair spec, wav = self.get_audio(audiopath) File "/home/ubuntu/9/data_utils.py", line 214, in get_audio spec = spectrogram_torch(audio_norm, self.filter_length, File "/home/ubuntu/9/mel_processing.py", line 63, in spectrogram_torch y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 3567, in _pad assert len(pad) == 4, '4D tensors expect 4 values for padding' AssertionError: 4D tensors expect 4 values for padding config.jsonnine:{'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 4, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'speakers': ['s1', 's2', 's3', 's4'], 'symbols': ['_', ',', '.', '!', '?', '-', 'A', 'E', 'I', 'N', 'O', 'Q', 'U', 'a', 'b', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'ʃ', 'ʧ', '↓', '↑', ' '], 'model_dir': '/home/ubuntu/nine'} `

    opened by kagura114 3
  • some advice about MAS Algorithm

    some advice about MAS Algorithm

    As the GlowTTS MAS definition: MAS

    to satisfy monotonicity and surjection, if z_j and {u_i, σ_i} are aligned, the previous latent variable z_j-1 should have been aligned to either {u_i-1, σ_i-1} or {u_i, σ_i}, which is equal to the Forward Attention for Tacotron2(https://arxiv.org/abs/1807.06736), my question is:

    1. replacing max(Q_i-1_j-1, Q_i_j-1) with log(exp(Q_i-1_j-1) + exp(Q_i_j-1)) is a better choice?
    opened by BridgetteSong 3
  • Question about VITS KL Loss Formula

    Question about VITS KL Loss Formula

    Thank you for this fantasic work. I think the kl loss in VITS might have missed one term. See: https://statproofbook.github.io/P/norm-kl.html#mjx-eqn-eq%3Anorm-KL

    opened by MMingabc 0
  • VCTK Multi Speaker Models Training Results (How many steps required for results like Pretrained)?

    VCTK Multi Speaker Models Training Results (How many steps required for results like Pretrained)?

    Hello Everyone,

    I'm training a VCTK dataset (22050 sampling rate), downloaded, for the multi-Speaker model. I have trained for 350000 steps and yet the quality of synthesis is not good as pre-trained models in the repo. How many steps will get a similar result?

    Dataset was resampled by me from 48000 to 22050.

    Dataset Source : https://www.kaggle.com/datasets/showmik50/vctk-dataset

    opened by athenasaurav 1
  • Problems about training with multiprocess

    Problems about training with multiprocess

    @jaywalnut310 My platform is RTX 3060 with pytorch 1.10.0+cu113. I found the following exception triggered when executing train_ms.py:

    THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=710 : device-side assert triggered
    terminate called after throwing an instance of 'c10::Error'
      what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
    Process Group destroyed on rank 0
    Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f854408dd62 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f854408a68b in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
    frame #2: <unknown function> + 0x30a6c6e (0x7f85a136ec6e in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x113 (0x7f85a1357813 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f85a1357a39 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
    frame #5: <unknown function> + 0xe97556 (0x7f860b65d556 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    frame #6: <unknown function> + 0xe7d085 (0x7f860b643085 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    frame #7: <unknown function> + 0x2a35e8 (0x7f860aa695e8 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    frame #8: <unknown function> + 0x2a48ee (0x7f860aa6a8ee in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
    frame #9: <unknown function> + 0x13be28 (0x555afdef1e28 in /environment/miniconda3/bin/python)
    frame #10: PyDict_Clear + 0x133 (0x555afdef1c43 in /environment/miniconda3/bin/python)
    frame #11: <unknown function> + 0x13bc89 (0x555afdef1c89 in /environment/miniconda3/bin/python)
    frame #12: <unknown function> + 0x163141 (0x555afdf19141 in /environment/miniconda3/bin/python)
    frame #13: _PyGC_CollectNoFail + 0x2a (0x555afdf8600a in /environment/miniconda3/bin/python)
    frame #14: PyImport_Cleanup + 0x532 (0x555afdf61582 in /environment/miniconda3/bin/python)
    frame #15: Py_FinalizeEx + 0x6e (0x555afdf9f0de in /environment/miniconda3/bin/python)
    frame #16: Py_Exit + 0x8 (0x555afdf9f1f8 in /environment/miniconda3/bin/python)
    frame #17: <unknown function> + 0x1e92ae (0x555afdf9f2ae in /environment/miniconda3/bin/python)
    frame #18: PyErr_PrintEx + 0x2c (0x555afdf9f2fc in /environment/miniconda3/bin/python)
    frame #19: PyRun_SimpleStringFlags + 0x62 (0x555afdfa4b72 in /environment/miniconda3/bin/python)
    frame #20: <unknown function> + 0x1eec4a (0x555afdfa4c4a in /environment/miniconda3/bin/python)
    frame #21: _Py_UnixMain + 0x3c (0x555afdfa500c in /environment/miniconda3/bin/python)
    frame #22: __libc_start_main + 0xf3 (0x7f860dcdc0b3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #23: <unknown function> + 0x1c24db (0x555afdf784db in /environment/miniconda3/bin/python)
    
    Traceback (most recent call last):
      File "train_ms.py", line 295, in <module>
        main()
      File "train_ms.py", line 50, in main
        mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
        return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
        while not context.join():
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
        raise ProcessRaisedException(msg, error_index, failed_process.pid)
    torch.multiprocessing.spawn.ProcessRaisedException: 
    
    -- Process 0 terminated with the following error:
    Traceback (most recent call last):
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
        fn(i, *args)
      File "/home/featurize/data/vits-main/train_ms.py", line 119, in run
        train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
      File "/home/featurize/data/vits-main/train_ms.py", line 147, in train_and_evaluate
        (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths, speakers)
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
        output = self.module(*inputs[0], **kwargs[0])
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/featurize/data/vits-main/models.py", line 467, in forward
        z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/featurize/data/vits-main/models.py", line 237, in forward
        x = self.enc(x, x_mask, g=g)
      File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/featurize/data/vits-main/modules.py", line 166, in forward
        n_channels_tensor)
    RuntimeError: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript (most recent call last):
      File "/home/featurize/data/vits-main/commons.py", line 103, in fused_add_tanh_sigmoid_multiply
    def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
      n_channels_int = n_channels[0]
      in_act = input_a + input_b
               ~~~~~~~~~~~~~~~~~ <--- HERE
      t_act = torch.tanh(in_act[:, :n_channels_int, :])
      s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
    RuntimeError: CUDA error: device-side assert triggered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    

    It seen that this problem happened because training on multiple GPUs. Since my platform has only one GPU, I don't know how this happened. Look forward to your kind reply! Thank you!

    opened by BrianWayland 0
  • How to finetune the given pre-trained model?

    How to finetune the given pre-trained model?

    I was looking to finetune the vits model using some custom data, but noticed that the released model is only the generator, and discriminator is also required to continue training from the checkpoint. Is the discriminator model available somewhere or is there any way to finetune the available model?

    opened by apzl 0
Owner
Jaehyeon Kim
Jaehyeon Kim
Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍 基于SEGAN模型的改进版本,使用自主设计的非

Nitin 53 Nov 17, 2022
CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

Fraunhofer SCAI 10 Oct 11, 2022
Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger.

Init Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger. 本项目基于 https://github.com/jaywalnut310/vits https://github.com/S

AmorTX 107 Dec 23, 2022
Clockwork Variational Autoencoder

Clockwork Variational Autoencoders (CW-VAE) Vaibhav Saxena, Jimmy Ba, Danijar Hafner If you find this code useful, please reference in your paper: @ar

Vaibhav Saxena 35 Nov 6, 2022
Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

Manga Filling with ScreenVAE SIGGRAPH ASIA 2020 | Project Website | BibTex This repository is for ScreenVAE introduced in the following paper "Manga F

null 30 Dec 24, 2022
Recurrent Variational Autoencoder that generates sequential data implemented with pytorch

Pytorch Recurrent Variational Autoencoder Model: This is the implementation of Samuel Bowman's Generating Sentences from a Continuous Space with Kim's

Daniil Gavrilov 347 Nov 14, 2022
Variational autoencoder for anime face reconstruction

VAE animeface Variational autoencoder for anime face reconstruction Introduction This repository is an exploratory example to train a variational auto

Minzhe Zhang 2 Dec 11, 2021
PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

Subin An 8 Nov 21, 2022
Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Installation After cloning the repo open

null 37 Dec 3, 2022
Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

RAVE: Realtime Audio Variational autoEncoder Official implementation of RAVE: A variational autoencoder for fast and high-quality neural audio synthes

ACIDS 587 Jan 1, 2023
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Visual Adversarial Imitation Learning using Variational Models (VMAIL)

Visual Adversarial Imitation Learning using Variational Models (VMAIL) This is the official implementation of the NeurIPS 2021 paper. Project website

null 14 Nov 18, 2022
PyTorch implementation of "A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing"

Implementation of the Sheffield entry for the first Clarity enhancement challenge (CEC1) This repository contains the PyTorch implementation of "A Two

null 10 Aug 19, 2022
A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

張致強 14 Dec 2, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
Image-to-Image Translation with Conditional Adversarial Networks (Pix2pix) implementation in keras

pix2pix-keras Pix2pix implementation in keras. Original paper: Image-to-Image Translation with Conditional Adversarial Networks (pix2pix) Paper Author

William Falcon 141 Dec 30, 2022
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

null 3k Jan 8, 2023
PyTorch implementation of "Image-to-Image Translation Using Conditional Adversarial Networks".

pix2pix-pytorch PyTorch implementation of Image-to-Image Translation Using Conditional Adversarial Networks. Based on pix2pix by Phillip Isola et al.

mrzhu 383 Dec 17, 2022
Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang (wtz920729

null 7 Jan 3, 2023