VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim

Last update: Jan 8, 2023

Related tags

Deep Learning vits

Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

Comments

Anybody having luck fine tuning?

I'm using a clean 40hr dataset, female American, that I used on tacotron with good results. I've trained on VITS twice now and it starts over fitting around 70K. It's definitely intelligible and in the correct tone but the prosody is way off. First run had default configs. Second run I tried decreasing learning rate and lr decay. It helped some with overall loss, but still started overfitting around 70K.

opened by TaoTeCha 18
KL Loss is right?

When I searched KL-divergence between two Gaussians, I got this which is diffenrent from your KL loss https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
good first issue

opened by BridgetteSong 13
Training error

I just run train.py and got this error INFO:baker_base:{'train': {'log_interval': 200, 'eval_interval': 10000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/baker_train.txt', 'validation_files': 'filelists/baker_valid.txt', 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs\baker_base'} WARNING:baker_base:E:\vits\ is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. Traceback (most recent call last): File "train.py", line 294, in main() File "train.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "E:\vits\train.py", line 119, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "E:\vits\train.py", line 139, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader): File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 530, in next data = self._next_data() File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1224, in _next_data return self._process_data(data) File "C:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 1250, in _process_data data.reraise() File "C:\Python38\lib\site-packages\torch_utils.py", line 457, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "C:\Python38\lib\site-packages\torch\utils\data_utils\worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "C:\Python38\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\Python38\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "E:\vits\data_utils.py", line 90, in getitem return self.get_audio_text_pair(self.audiopaths_and_text[index]) File "E:\vits\data_utils.py", line 61, in get_audio_text_pair spec, wav = self.get_audio(audiopath) File "E:\vits\data_utils.py", line 67, in get_audio raise ValueError("{} {} SR doesn't match target {} SR".format( IndexError: Replacement index 2 out of range for positional args tuple Can anyone help me?

opened by Eternity231 6
CPU infer slow
Thank you for sharing your code. I tried on my own dataset(Chinese) with config: { "train": { "log_interval": 200, "eval_interval": 1000, "seed": 1234, "epochs": 10000, "learning_rate": 2e-4, "betas": [0.8, 0.99], "eps": 1e-9, "batch_size": 32, "fp16_run": false, "lr_decay": 0.999875, "segment_size": 8192, "init_lr_ratio": 1, "warmup_epochs": 0, "c_mel": 45, "c_kl": 1.0 }, "data": { "training_files":"filelists/mt_f065_train_filelist.txt", "validation_files":"filelists/mt_f065_val_filelist.txt", "text_cleaners":["collapse_whitespace"], "max_wav_value": 32768.0, "sampling_rate": 16000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80, "mel_fmin": 0.0, "mel_fmax": null, "add_blank": true, "n_speakers": 0, "cleaned_text": false }, "model": { "inter_channels": 192, "hidden_channels": 192, "filter_channels": 768, "n_heads": 2, "n_layers": 6, "kernel_size": 3, "p_dropout": 0.1, "resblock": "1", "resblock_kernel_sizes": [3,7,11], "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], "upsample_rates": [8,8,2,2], "upsample_initial_channel": 512, "upsample_kernel_sizes": [16,16,4,4], "n_layers_q": 3, "use_spectral_norm": false, "gin_channels": 256 } }

And synthesis on CPU in 1 batch, but the speed is not as good as using GPU. total audio length: 77.76s total cost GPU: 4.84s total cost CPU: 151.26s average rtf GPU: 0.06 average rtf CPU: 1.95

I checked the checkpoint and found the G_*.pth file size is up to 445M.

My question is:

Is the CPU infer time correct?

Is the checkpoint I'm using correct?

Could you kindly give any ideas about how to make CPU infer faster?

Thank you in advance
opened by OnceJune 6

How to fix the noise during inference time?

Hi Jaehyeon,

May I ask how to fix the stochastic noise during inference time? I want some generated audio to be reproducable, thus need to fix the random noise part. Currently it seems I can only control the noise scale.

sid = torch.LongTensor([1]) # speaker identity
stn_tst = get_text("Tell me the answer please", hps_ms)

with torch.no_grad():
    x_tst = stn_tst.unsqueeze(0)
    x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
    audio = net_g_ms.infer(x_tst, x_tst_lengths, sid = sid, noise_scale=1, noise_scale_w=2, length_scale=1)[0][0,0].data.float().numpy()
ipd.display(ipd.Audio(audio, rate=hps_ms.data.sampling_rate))

opened by xinghua-qu 5

Can I train on a Vietnamese custom dataset? If yes, can you specify how to do it?

I got that issue when I train on a custom Vietnamese dataset. At first, I thought maybe some wav files on my dataset is corupted, but it's not? How can I solve this?

[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/viet_audio_text_train_filelist.txt.cleaned', 'validation_files': 'filelists/viet_audio_text_val_filelist.txt.cleaned', 'max_wav_value': 32768.0, 'text_cleaners': ['english_cleaners2'], 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/viet_base'} Traceback (most recent call last): File "train.py", line 290, in main() File "train.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/train.py", line 117, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/train.py", line 137, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 94, in getitem return self.get_audio_text_pair(self.audiopaths_and_text[index]) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 62, in get_audio_text_pair spec, wav = self.get_audio(audiopath) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/data_utils.py", line 66, in get_audio audio, sampling_rate = load_wav_to_torch(filename) File "/content/gdrive/.shortcut-targets-by-id/1FR_wE-NryBYYhwIzjEI52awuLJDEAFfV/vits/utils.py", line 134, in load_wav_to_torch sampling_rate, data = read(full_path) File "/usr/local/lib/python3.7/dist-packages/scipy/io/wavfile.py", line 609, in read return fs, data UnboundLocalError: local variable 'fs' referenced before assignment

opened by tuannvhust 4
about multi-speaker data

Thanks for your great work. I want to train with my own dataset. So I want to ask what does this number '67' mean in your data? And how it is calculated.

E.g：DUMMY2/p229/p229_128.wav|67|The whole process is a vicious circle at the moment.

opened by LH997 4
Mispronunciation and what is the purpose of the add_blank config ?
I've trained my own dataset with the default config except for add_blank option (I changed it to false). I know the add_blank option will add the 0 between the symbol ids but in my case it's been disabled. And I trained with phonemes and got 200K steps right now, but some phonemes seem to be spelled wrong. So I have some question ?

What is the purpose of the add_blank config ?

The reason for model to spell wrong ? How can I improve my model with the pronunciation ?
opened by leminhnguyen 4
espeak

Hello, I am looking to make a web demo for this on gradio hub https://gradio.app/hub, could this be used instead of espeak? https://pypi.org/project/py-espeak-ng/

opened by AK391 4
AssertionError: 4D tensors expect 4 values for padding

Training with 4 persons and reported ` File "train_ms.py", line 299, in main() File "train_ms.py", line 55, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/ubuntu/9/train_ms.py", line 123, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/home/ubuntu/9/train_ms.py", line 143, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers) in enumerate(train_loader): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 971, in _next_data return self._process_data(data) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in DataLoader worker process 6. Original Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ubuntu/9/data_utils.py", line 236, in getitem return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index]) File "/home/ubuntu/9/data_utils.py", line 199, in get_audio_text_speaker_pair spec, wav = self.get_audio(audiopath) File "/home/ubuntu/9/data_utils.py", line 214, in get_audio spec = spectrogram_torch(audio_norm, self.filter_length, File "/home/ubuntu/9/mel_processing.py", line 63, in spectrogram_torch y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 3567, in _pad assert len(pad) == 4, '4D tensors expect 4 values for padding' AssertionError: 4D tensors expect 4 values for padding config.jsonnine:{'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 4, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'speakers': ['s1', 's2', 's3', 's4'], 'symbols': ['_', ',', '.', '!', '?', '-', 'A', 'E', 'I', 'N', 'O', 'Q', 'U', 'a', 'b', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'ʃ', 'ʧ', '↓', '↑', ' '], 'model_dir': '/home/ubuntu/nine'} `

opened by kagura114 3
some advice about MAS Algorithm
As the GlowTTS MAS definition:

to satisfy monotonicity and surjection, if z_j and {u_i, σ_i} are aligned, the previous latent variable z_j-1 should have been aligned to either {u_i-1, σ_i-1} or {u_i, σ_i}, which is equal to the Forward Attention for Tacotron2(https://arxiv.org/abs/1807.06736), my question is:

replacing max(Q_i-1_j-1, Q_i_j-1) with log(exp(Q_i-1_j-1) + exp(Q_i_j-1)) is a better choice?
opened by BridgetteSong 3
Question about VITS KL Loss Formula

Thank you for this fantasic work. I think the kl loss in VITS might have missed one term. See: https://statproofbook.github.io/P/norm-kl.html#mjx-eqn-eq%3Anorm-KL

opened by MMingabc 0
VCTK Multi Speaker Models Training Results (How many steps required for results like Pretrained)?

Hello Everyone,

I'm training a VCTK dataset (22050 sampling rate), downloaded, for the multi-Speaker model. I have trained for 350000 steps and yet the quality of synthesis is not good as pre-trained models in the repo. How many steps will get a similar result?

Dataset was resampled by me from 48000 to 22050.

Dataset Source : https://www.kaggle.com/datasets/showmik50/vctk-dataset

opened by athenasaurav 1

Problems about training with multiprocess

@jaywalnut310 My platform is RTX 3060 with pytorch 1.10.0+cu113. I found the following exception triggered when executing train_ms.py:

THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=710 : device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 0
Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f854408dd62 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f854408a68b in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x30a6c6e (0x7f85a136ec6e in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x113 (0x7f85a1357813 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f85a1357a39 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #5: <unknown function> + 0xe97556 (0x7f860b65d556 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xe7d085 (0x7f860b643085 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x2a35e8 (0x7f860aa695e8 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2a48ee (0x7f860aa6a8ee in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x13be28 (0x555afdef1e28 in /environment/miniconda3/bin/python)
frame #10: PyDict_Clear + 0x133 (0x555afdef1c43 in /environment/miniconda3/bin/python)
frame #11: <unknown function> + 0x13bc89 (0x555afdef1c89 in /environment/miniconda3/bin/python)
frame #12: <unknown function> + 0x163141 (0x555afdf19141 in /environment/miniconda3/bin/python)
frame #13: _PyGC_CollectNoFail + 0x2a (0x555afdf8600a in /environment/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x532 (0x555afdf61582 in /environment/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x6e (0x555afdf9f0de in /environment/miniconda3/bin/python)
frame #16: Py_Exit + 0x8 (0x555afdf9f1f8 in /environment/miniconda3/bin/python)
frame #17: <unknown function> + 0x1e92ae (0x555afdf9f2ae in /environment/miniconda3/bin/python)
frame #18: PyErr_PrintEx + 0x2c (0x555afdf9f2fc in /environment/miniconda3/bin/python)
frame #19: PyRun_SimpleStringFlags + 0x62 (0x555afdfa4b72 in /environment/miniconda3/bin/python)
frame #20: <unknown function> + 0x1eec4a (0x555afdfa4c4a in /environment/miniconda3/bin/python)
frame #21: _Py_UnixMain + 0x3c (0x555afdfa500c in /environment/miniconda3/bin/python)
frame #22: __libc_start_main + 0xf3 (0x7f860dcdc0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: <unknown function> + 0x1c24db (0x555afdf784db in /environment/miniconda3/bin/python)

Traceback (most recent call last):
  File "train_ms.py", line 295, in <module>
    main()
  File "train_ms.py", line 50, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/featurize/data/vits-main/train_ms.py", line 119, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/home/featurize/data/vits-main/train_ms.py", line 147, in train_and_evaluate
    (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths, speakers)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/models.py", line 467, in forward
    z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/models.py", line 237, in forward
    x = self.enc(x, x_mask, g=g)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/modules.py", line 166, in forward
    n_channels_tensor)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/featurize/data/vits-main/commons.py", line 103, in fused_add_tanh_sigmoid_multiply
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
  n_channels_int = n_channels[0]
  in_act = input_a + input_b
           ~~~~~~~~~~~~~~~~~ <--- HERE
  t_act = torch.tanh(in_act[:, :n_channels_int, :])
  s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seen that this problem happened because training on multiple GPUs. Since my platform has only one GPU, I don't know how this happened. Look forward to your kind reply! Thank you!

opened by BrianWayland 0

How to finetune the given pre-trained model?

I was looking to finetune the vits model using some custom data, but noticed that the released model is only the generator, and discriminator is also required to continue training from the checkpoint. Is the discriminator model available somewhere or is there any way to finetune the available model?

opened by apzl 0

Owner

Jaehyeon Kim

GitHub

Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍基于SEGAN模型的改进版本，使用自主设计的非

53 Nov 17, 2022

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

10 Oct 11, 2022

Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger.

Init Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger. 本项目基于 https://github.com/jaywalnut310/vits https://github.com/S

107 Dec 23, 2022

Clockwork Variational Autoencoder

Clockwork Variational Autoencoders (CW-VAE) Vaibhav Saxena, Jimmy Ba, Danijar Hafner If you find this code useful, please reference in your paper: @ar

35 Nov 6, 2022

Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

Manga Filling with ScreenVAE SIGGRAPH ASIA 2020 | Project Website | BibTex This repository is for ScreenVAE introduced in the following paper "Manga F

30 Dec 24, 2022

Recurrent Variational Autoencoder that generates sequential data implemented with pytorch

Pytorch Recurrent Variational Autoencoder Model: This is the implementation of Samuel Bowman's Generating Sentences from a Continuous Space with Kim's

347 Nov 14, 2022

Variational autoencoder for anime face reconstruction

VAE animeface Variational autoencoder for anime face reconstruction Introduction This repository is an exploratory example to train a variational auto

2 Dec 11, 2021

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

8 Nov 21, 2022

Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Installation After cloning the repo open

37 Dec 3, 2022

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

RAVE: Realtime Audio Variational autoEncoder Official implementation of RAVE: A variational autoencoder for fast and high-quality neural audio synthes

587 Jan 1, 2023

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

Visual Adversarial Imitation Learning using Variational Models (VMAIL)

Visual Adversarial Imitation Learning using Variational Models (VMAIL) This is the official implementation of the NeurIPS 2021 paper. Project website

14 Nov 18, 2022

PyTorch implementation of "A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing"

Implementation of the Sheffield entry for the first Clarity enhancement challenge (CEC1) This repository contains the PyTorch implementation of "A Two

10 Aug 19, 2022

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

14 Dec 2, 2022

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

37 Dec 15, 2022

Image-to-Image Translation with Conditional Adversarial Networks (Pix2pix) implementation in keras

pix2pix-keras Pix2pix implementation in keras. Original paper: Image-to-Image Translation with Conditional Adversarial Networks (pix2pix) Paper Author

141 Dec 30, 2022

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 8, 2023

PyTorch implementation of "Image-to-Image Translation Using Conditional Adversarial Networks".

pix2pix-pytorch PyTorch implementation of Image-to-Image Translation Using Conditional Adversarial Networks. Based on pix2pix by Phillip Isola et al.

383 Dec 17, 2022

Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang (wtz920729

7 Jan 3, 2023