Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Tomoki Hayashi

Last update: Dec 23, 2022

Related tags

Text Data & NLP text-to-speech realtime pytorch tts speech-synthesis wavenet vocoder parallel-wavenet neural-vocoder melgan hifigan style-melgan

Overview

Parallel WaveGAN implementation with Pytorch

This repository provides UNOFFICIAL pytorch implementations of the following models:

You can combine these state-of-the-art non-autoregressive models to build your own great vocoder!

Please check our samples in our demo HP.

Source of the figure: https://arxiv.org/pdf/1910.11480.pdf

The goal of this repository is to provide real-time neural vocoder, which is compatible with ESPnet-TTS.
Also, this repository can be combined with NVIDIA/tacotron2-based implementation (See this comment).

You can try the real-time end-to-end text-to-speech demonstration in Google Colab!

Real-time demonstration with ESPnet2
Real-time demonstration with ESPnet1

What's new

2021/10/21 Single-speaker Korean recipe [egs/kss/voc1] is available.
2021/08/24 Add more pretrained models of StyleMelGAN and HiFi-GAN.
2021/08/07 Add initial pretrained models of StyleMelGAN and HiFi-GAN.
2021/08/03 Support StyleMelGAN generator and discriminator!
2021/08/02 Support HiFi-GAN generator and discriminator!
2020/10/07 JSSS recipe is available!
2020/08/19 Real-time demo with ESPnet2 is available!
2020/05/29 VCTK, JSUT, and CSMSC multi-band MelGAN pretrained model is available!
2020/05/27 New LJSpeech multi-band MelGAN pretrained model is available!
2020/05/24 LJSpeech full-band MelGAN pretrained model is available!
2020/05/22 LJSpeech multi-band MelGAN pretrained model is available!
2020/05/16 Multi-band MelGAN is available!
2020/03/25 LibriTTS pretrained models are available!
2020/03/17 Tensorflow conversion example notebook is available (Thanks, @dathudeptrai)!
2020/03/16 LibriTTS recipe is available!
2020/03/12 PWG G + MelGAN D + STFT-loss samples are available!
2020/03/12 Multi-speaker English recipe egs/vctk/voc1 is available!
2020/02/22 MelGAN G + MelGAN D + STFT-loss samples are available!
2020/02/12 Support MelGAN's discriminator!
2020/02/08 Support MelGAN's generator!

Requirements

This repository is tested on Ubuntu 20.04 with a GPU Titan V.

Python 3.6+
Cuda 10.0+
CuDNN 7+
NCCL 2+ (for distributed multi-gpu training)
libsndfile (you can install via sudo apt install libsndfile-dev in ubuntu)
jq (you can install via sudo apt install jq in ubuntu)
sox (you can install via sudo apt install sox in ubuntu)

Different cuda version should be working but not explicitly tested.
All of the codes are tested on Pytorch 1.4, 1.5.1, 1.7.1, 1.8.1, and 1.9.

Pytorch 1.6 works but there are some issues in cpu mode (See #198).

Setup

You can select the installation method from two alternatives.

A. Use pip

$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN
$ pip install -e .
# If you want to use distributed training, please install
# apex manually by following https://github.com/NVIDIA/apex
$ ...

Note that your cuda version must be exactly matched with the version used for the pytorch binary to install apex.
To install pytorch compiled with different cuda version, see tools/Makefile.

B. Make virtualenv

$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN/tools
$ make
# If you want to use distributed training, please run following
# command to install apex.
$ make apex

Note that we specify cuda version used to compile pytorch wheel.
If you want to use different cuda version, please check tools/Makefile to change the pytorch wheel to be installed.

Recipe

This repository provides Kaldi-style recipes, as the same as ESPnet.
Currently, the following recipes are supported.

LJSpeech: English female speaker
JSUT: Japanese female speaker
JSSS: Japanese female speaker
CSMSC: Mandarin female speaker
CMU Arctic: English speakers
JNAS: Japanese multi-speaker
VCTK: English multi-speaker
LibriTTS: English multi-speaker
YesNo: English speaker (For debugging)

To run the recipe, please follow the below instruction.

# Let us move on the recipe directory
$ cd egs/ljspeech/voc1

# Run the recipe from scratch
$ ./run.sh

# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config>

# You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2

# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl

See more info about the recipes in this README.

Speed

The decoding speed is RTF = 0.016 with TITAN V, much faster than the real-time.

[decode]: 100%|██████████| 250/250 [00:30<00:00,  8.31it/s, RTF=0.0156]
2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

Even on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads), it can generate less than the real-time.

[decode]: 100%|██████████| 250/250 [22:16<00:00,  5.35s/it, RTF=0.841]
2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).

If you use MelGAN's generator, the decoding speed will be further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [04:00<00:00,  1.04it/s, RTF=0.0882]
2020-02-08 10:45:14,111 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.137).

# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:06<00:00, 36.38it/s, RTF=0.00189]
2020-02-08 05:44:42,231 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.002).

If you use Multi-band MelGAN's generator, the decoding speed will be much further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [01:47<00:00,  2.95it/s, RTF=0.048]
2020-05-22 15:37:19,771 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.059).

# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:05<00:00, 43.67it/s, RTF=0.000928]
2020-05-22 15:35:13,302 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.001).

If you want to accelerate the inference more, it is worthwhile to try the conversion from pytorch to tensorflow.
The example of the conversion is available in the notebook (Provided by @dathudeptrai).

Results

Here the results are summarized in the table.
You can listen to the samples and download pretrained models from the link to our google drive.

Model	Conf	Lang	Fs [Hz]	Mel range [Hz]	FFT / Hop / Win [pt]	# iters
ljspeech_parallel_wavegan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	400k
ljspeech_parallel_wavegan.v1.long	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_parallel_wavegan.v1.no_limit	link	EN	22.05k	None	1024 / 256 / None	400k
ljspeech_parallel_wavegan.v3	link	EN	22.05k	80-7600	1024 / 256 / None	3M
ljspeech_melgan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	400k
ljspeech_melgan.v1.long	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_melgan_large.v1	link	EN	22.05k	80-7600	1024 / 256 / None	400k
ljspeech_melgan_large.v1.long	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_melgan.v3	link	EN	22.05k	80-7600	1024 / 256 / None	2M
ljspeech_melgan.v3.long	link	EN	22.05k	80-7600	1024 / 256 / None	4M
ljspeech_full_band_melgan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_full_band_melgan.v2	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_multi_band_melgan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_multi_band_melgan.v2	link	EN	22.05k	80-7600	1024 / 256 / None	1M
ljspeech_hifigan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	2.5M
ljspeech_style_melgan.v1	link	EN	22.05k	80-7600	1024 / 256 / None	1.5M
jsut_parallel_wavegan.v1	link	JP	24k	80-7600	2048 / 300 / 1200	400k
jsut_multi_band_melgan.v2	link	JP	24k	80-7600	2048 / 300 / 1200	1M
just_hifigan.v1	link	JP	24k	80-7600	2048 / 300 / 1200	2.5M
just_style_melgan.v1	link	JP	24k	80-7600	2048 / 300 / 1200	1.5M
csmsc_parallel_wavegan.v1	link	ZH	24k	80-7600	2048 / 300 / 1200	400k
csmsc_multi_band_melgan.v2	link	ZH	24k	80-7600	2048 / 300 / 1200	1M
csmsc_hifigan.v1	link	ZH	24k	80-7600	2048 / 300 / 1200	2.5M
csmsc_style_melgan.v1	link	ZH	24k	80-7600	2048 / 300 / 1200	1.5M
arctic_slt_parallel_wavegan.v1	link	EN	16k	80-7600	1024 / 256 / None	400k
jnas_parallel_wavegan.v1	link	JP	16k	80-7600	1024 / 256 / None	400k
vctk_parallel_wavegan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	400k
vctk_parallel_wavegan.v1.long	link	EN	24k	80-7600	2048 / 300 / 1200	1M
vctk_multi_band_melgan.v2	link	EN	24k	80-7600	2048 / 300 / 1200	1M
vctk_hifigan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	2.5M
vctk_style_melgan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	1.5M
libritts_parallel_wavegan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	400k
libritts_parallel_wavegan.v1.long	link	EN	24k	80-7600	2048 / 300 / 1200	1M
libritts_multi_band_melgan.v2	link	EN	24k	80-7600	2048 / 300 / 1200	1M
libritts_hifigan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	2.5M
libritts_style_melgan.v1	link	EN	24k	80-7600	2048 / 300 / 1200	1.5M
kss_parallel_wavegan.v1	link	KO	24k	80-7600	2048 / 300 / 1200	400k
hui_acg_hokuspokus_parallel_wavegan.v1	link	DE	24k	80-7600	2048 / 300 / 1200	400k
ruslan_parallel_wavegan.v1	link	RU	24k	80-7600	2048 / 300 / 1200	400k

Please access at our google drive to check more results.

How-to-use pretrained models

Analysis-synthesis

Here the minimal code is shown to perform analysis-synthesis using the pretrained model.

# Please make sure you installed `parallel_wavegan`
# If not, please install via pip
$ pip install parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")
EOF

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF

# Now you can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

# Please put an audio file in `sample` directory to perform analysis-synthesis
$ ls sample/
  sample.wav

# Then perform feature extraction -> feature normalization -> synthesis
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-normalize \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir dump/sample/raw \
    --dumpdir dump/sample/norm \
    --stats pretrain_model/<pretrain_model_tag>/stats.h5
2019-11-13 13:44:29,574 (normalize:87) INFO: the number of files = 1.
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 513.13it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/norm \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# You can skip normalization step (on-the-fly normalization, feature extraction -> synthesis)
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/raw \
    --normalize-before \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# you can find the generated speech in `sample` directory
$ ls sample
  sample.wav    sample_gen.wav

Decoding with ESPnet-TTS model's features

Here, I show the procedure to generate waveforms with features generated by ESPnet-TTS models.

# Make sure you already finished running the recipe of ESPnet-TTS.
# You must use the same feature settings for both Text2Mel and Mel2Wav models.
# Let us move on "ESPnet" recipe directory
$ cd /path/to/espnet/egs/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs/<recipe_name>/tts1

# If you use ESPnet2, move on `egs2/`
$ cd /path/to/espnet/egs2/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs2/<recipe_name>/tts1

# Please install this repository in ESPnet conda (or virtualenv) environment
$ . ./path.sh && pip install -U parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")
EOF

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF

# You can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

Case 1: If you use the same dataset for both Text2Mel and Mel2Wav

# In this case, you can directly use generated features for decoding.
# Please specify `feats.scp` path for `--feats-scp`, which is located in
# exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp.
# Note that do not use outputs_*decode_denorm/<set_name>/feats.scp since
# it is de-normalized features (the input for PWG is normalized features).
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp \
    --outdir <path_to_outdir>

# In the case of ESPnet2, the generated feature can be found in
# exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp.
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp \
    --outdir <path_to_outdir>

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

Case 2: If you use different datasets for Text2Mel and Mel2Wav models

# In this case, you must provide `--normalize-before` option additionally.
# And use `feats.scp` of de-normalized generated features.

# ESPnet1 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode_denorm/<set_name>/feats.scp \
    --outdir <path_to_outdir> \
    --normalize-before

# ESPnet2 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/denorm/feats.scp \
    --outdir <path_to_outdir> \
    --normalize-before

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

If you want to combine these models in python, you can try the real-time demonstration in Google Colab!

Real-time demonstration with ESPnet2
Real-time demonstration with ESPnet1

Decoding with dumped npy files

Sometimes we want to decode with dumped npy files, which are mel-spectrogram generated by TTS models. Please make sure you used the same feature extraction settings of the pretrained vocoder (fs, fft_size, hop_size, win_length, fmin, and fmax).
Only the difference of log_base can be changed with some post-processings (we use log 10 instead of natural log as a default). See detail in the comment.

# Generate dummy npy file of mel-spectrogram
$ ipython
[ins] In [1]: import numpy as np
[ins] In [2]: x = np.random.randn(512, 80)  # (#frames, #mels)
[ins] In [3]: np.save("dummy_1.npy", x)
[ins] In [4]: y = np.random.randn(256, 80)  # (#frames, #mels)
[ins] In [5]: np.save("dummy_2.npy", y)
[ins] In [6]: exit

# Make scp file (key-path format)
$ find -name "*.npy" | awk '{print "dummy_" NR " " $1}' > feats.scp

# Check (<utt_id> <path>)
$ cat feats.scp
dummy_1 ./dummy_1.npy
dummy_2 ./dummy_2.npy

# Decode without feature normalization
# This case assumes that the input mel-spectrogram is normalized with the same statistics of the pretrained model.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).

# Decode with feature normalization
# This case assumes that the input mel-spectrogram is not normalized.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --normalize-before \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).

References

Acknowledgement

The author would like to thank Ryuichi Yamamoto (@r9y9) for his great repository, paper, and valuable discussions.

Author

Tomoki Hayashi (@kan-bayashi)
E-mail: hayashi.tomoki<at>g.sp.m.is.nagoya-u.ac.jp

Comments

Multi-band MelGAN

Hi,

just found https://arxiv.org/pdf/2005.05106.pdf

It seems to provide significantly better quality than regular MelGAN, and is also stunningly fast (0.03 RTF on CPU). The authors will be publishing the code shortly.

Any chances we will see an implementation in this great repo? =)
discussion

opened by alexdemartos 154
Generator exploded after ~138K iters.

I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

Here is the tensorboard screenshot.

discussion

opened by erogol 58
Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2

Hi. I trained the tacotron2 model (https://github.com/NVIDIA/tacotron2) and Glow-TTS model (https://github.com/jaywalnut310/glow-tts) by using the LJ speech dataset and can successfully synthesize voice by using WaveGlow as vocoder. However, when I turned to the Parallel WaveGan, the synthzised waveform is quite strange:

(In the training time, the hop_size, sample_rate and window_size were set as the same for the tacotron, WaveGlow and waveGan model.)

I successfully synthesized speech using WaveGan with espnet's FastSpeech, but I failed to use waveGan to synthsize intelligible voice with any model derived from Nivida's Tacotron2 implementation (e.g. Glow-TTS). Could you please give me any advice? (Because in Nivida's Tacotron2, there is no cmvn to the input mel-spectrogram features, so I didn't calculate the cmvn of the training waves and didn't invert it back at the inference time)

Thank you very much!
question

opened by Charlottecuc 31
Many iterations of discriminator training causes strange noise
I compared the following two models:

(Red) The model which trains the discriminator from 200k iters

(Blue) The model which trains the discriminator from the first iter Here is the training curve.

From the curve, the blue one is better than the red in terms of log STFT magnitude loss.
However, the blue model causes strange noise.

You can listen to the samples. https://drive.google.com/open?id=1LL_A4ysUqKJ13YQBdQwzNBvGp8m8BhqY

I think this is caused by the discriminator (v1 is red and v2 is blue). If you have any idea or suggestion to avoid this issue, please share with me.
help wanted discussion
opened by kan-bayashi 21
How is the runtime on CPU?

Hi! Thx for the repo. I was curious about the performance on CPU. AFAIK, it is 8x real-time on GPU but could you also share some values about CPU performance?
question

opened by erogol 20
TTS + ParallelWaveGAN progress

If you don't mind, I like to share my progress with PWGAN with TTS.

Here is the first try results: https://soundcloud.com/user-565970875/sets/ljspeech_tacotron_5233_paralle

Results are not better than what we have with WaveRNN, I should say it is much faster.

There is a hissing noise in the backgroung. If you have any idea to get rid of this, please let me know.

The only difference in training (I guess) I don't apply mean-normalization to melspectrograms and I normalize to -4,4 range.
discussion

opened by erogol 18
training time for HiFiGAN LJSpeech

Hi

I am training HiFIGAN vocoder on LJSpeech, using the recipe provided . Its been running since more than a week.

I am using 4 Tesla GPUs with 32 GB memory

May I know how much time it took for you ?

@kan-bayashi
question

opened by nellorebhanuteja 17
Training StyleMelGan on custom dataset

Hello again :)

Are lab mono files required to do the training or that step can be skipped using this script https://gist.github.com/kan-bayashi/eceafcd35a2351f5f6bf89a1ccb956e9 ?
question

opened by skol101 17
StyleMelGAN tuning
2020/08/06

v1

MSE loss

batch size 32

repeats 4

v2

MSE Loss

batch size 8

repeats 4

v3

Hinge loss

batch size 8

repeats 4

Learning rate scheduling maybe need to investigate.
discussion
opened by kan-bayashi 17
WaveGAN training on Tacotron outputs.

Hey. I trained a Rayhane-mamah Tacotron 2 synthesizer without vocoder. As a vocoder, I wanted to use your repository, could you please tell me how to properly train WaveGAN? Need to train on GTA mels? If so, how to do it, if the preprocessing procedure in run.sh itself prepares mel spectrograms from ground truth audio on step 1?
question

opened by Alexey322 17
How can we know multi GPU is working?

It is mentioned in the paper that using more GPUs accelerates the training. I have three NVIDIA K80s and using the flags

--nnodes 1 --nproc_per_node 3 -c

Binds all three GPUs and ramps them up to 98% usage, however, I cannot see any decrease in waiting time or epoch rounds and leaving it overnight, did not return any marginally better results. Am I doing anythign wrong? How can we know it is actually working? I tried to set --nnodes 3 but training never even started.
question

opened by george-roussos 17
Unclear signal flow related to usage of mel spectrograms in StyleMelGAN
Hello,

This is probably just a documentation problem.

It is unclear how mel spectrograms are used by the StyleMelGAN generator module.

I've been trying to figure out how to format mel spectrograms so the generator will accept them. To figure that out, I've been looking at the initialization parameters of the StyleMelGANGenerator module.

The only obvious candidate for defining the format/dimensions of the input spectrogram is the aux_channels parameter. But that wouldn't make sense, for these reasons:

Its default value is 80, but a mel spectrogram contains much more than 80 points of data.

aux_channels controls only one parameter: the in_channels parameter of the first layer in the first TADEResBlock. That would make sense if if the mel spectrograms' dimensions corresponded to this parameter, but...

The diagram of StyleMelGAN's signal path in the original StyleMelGan paper conflicts with point 2); the diagram shows the spectrograms being inserted into every TADEResBlock, not just the first.

So my questions are:

What is aux_channels? (What kind of data is considered "auxiliary input" - am I correct that this is the spectrograms?)

If aux_channels does not determine how the input spectrograms should be formatted, what does?

If you can answer these questions for me, I would be happy to improve the documentation/comments myself.

Thank you!
question
opened by andrewrose43 1
how to convert model to torchscript？

import sys sys.path.insert(1,'/root/Downloads/ParallelWaveGAN-0.5.3/parallel_wavegan/utils') import torch import utils module = utils.load_model('pretrained_model/checkpoint-400000steps.pkl') print(module) #model = torch.load('pretrained_model/checkpoint-400000steps.pkl',map_location=torch.device('cpu')) #print('load model successful!') x = torch.zeros(5, 10, 5, dtype=torch.float64) x = x + (0.1**0.5)*torch.randn(5, 10, 5) c = torch.rand(80,80,5) print(x) print('-------------------') print(c) print('-------------------') print(x.size(-1)) print('-------------------') print(c.size(-1)) trace_model = torch.jit.trace(module,(x,c))

error is : Traceback (most recent call last): File "demo.py", line 19, in trace_model = torch.jit.trace(module,(x,c)) File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/jit/_trace.py", line 768, in trace _module_class, File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/jit/_trace.py", line 983, in trace_module argument_names, File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1178, in _slow_forward result = self.forward(*input, **kwargs) File "/root/Downloads/ParallelWaveGAN-0.5.3/parallel_wavegan/models/parallel_wavegan.py", line 159, in forward assert c.size(-1) == x.size(-1) AssertionError

how to set parametr x and c value？

opened by zhuziying 0
Low inference speed of TTS on GPU

May I ask why the RTF of TTS is only 0.09 for a 12-seconds sentence? I use fastspeech2_HIFiGAN model and GPU is A2000 (8.0 capability). I thought it should be 50x speedup at least. Because the paper of fastpeech2 says it has 50x than transformer and HifiGAN says it speed up 1000x. So can anyone tells me what's wrong? Thank you！
question

opened by dalvlv 2
Avocodo Discriminators

A new interesting vocoder was described in a paper yesterday. It's called Avocodo and supposedly helps with the artifacts that are typical for GAN based vocoding. It supposedly also works better for unseen speakers than HiFiGAN, although I never had any issues with HiFiGAN and unseen speakers anyways.

https://arxiv.org/pdf/2206.13404.pdf

The generator seems to be pretty much the same as HiFiGAN's, but it has some new discriminators, which I think would be a nice addition to this repository. Combining Avocodo with e.g. the MultiPreiodDiscriminator would be very interesting!
feature request

opened by Flux9665 0

If fine-tuning from pre-trained should generator_scheduler_params be updated?

I'm fine tuning Hifigan from 2.5ml steps pretrained model to 3ml steps.

I wonder if this is the way to go by updating milestones?

generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: MultiStepLR
generator_scheduler_params:
    gamma: 0.5
    milestones:
        - 2600000
        - 2700000
        - 2800000
        - 2900000
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 2600000
        - 2700000
        - 2800000
        - 2900000
discriminator_grad_norm: -1

opened by skol101 0

Releases(v0.5.5)

v0.5.5(May 17, 2022)
What's Changed

add recipe for kiritan & ofuton_p_utagoe db (singing voice synthesis) by @PeterGuoRuc in https://github.com/kan-bayashi/ParallelWaveGAN/pull/330

Add recipe for Opencpop by @ftshijt in https://github.com/kan-bayashi/ParallelWaveGAN/pull/332

add causal option for HiFiGAN by @chomeyama in https://github.com/kan-bayashi/ParallelWaveGAN/pull/326

Fix HiFiGAN compatibility by @kan-bayashi in https://github.com/kan-bayashi/ParallelWaveGAN/pull/334

add recipe for natsume (singing voice synthesis) by @PeterGuoRuc in https://github.com/kan-bayashi/ParallelWaveGAN/pull/336

Update readme with pre-trained models on svs and demonstration by @ftshijt in https://github.com/kan-bayashi/ParallelWaveGAN/pull/335

Add recipes and pretrained models for CSD (Korean&English) and KiSIng (Mandarin) databases by @ftshijt in https://github.com/kan-bayashi/ParallelWaveGAN/pull/342

Add new recipe PJS (singing voice synthesis) by @A-Quarter-Mile in https://github.com/kan-bayashi/ParallelWaveGAN/pull/347

add no7singing training by @frankxu2004 in https://github.com/kan-bayashi/ParallelWaveGAN/pull/349

Add icelandic by @G-Thor in https://github.com/kan-bayashi/ParallelWaveGAN/pull/354

add tag_or_url for download_pretrained_model by @roholazandie in https://github.com/kan-bayashi/ParallelWaveGAN/pull/361

Apply black by @kan-bayashi in https://github.com/kan-bayashi/ParallelWaveGAN/pull/362

Update to v0.5.5 by @kan-bayashi in https://github.com/kan-bayashi/ParallelWaveGAN/pull/363

New Contributors

@PeterGuoRuc made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/330

@chomeyama made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/326

@A-Quarter-Mile made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/347

@frankxu2004 made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/349

@G-Thor made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/354

@roholazandie made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/361

Full Changelog: https://github.com/kan-bayashi/ParallelWaveGAN/compare/v0.5.4...v0.5.5
Source code(tar.gz)
Source code(zip)
v0.5.4(Feb 10, 2022)
What's Changed

add kss recipe by @windtoker in https://github.com/kan-bayashi/ParallelWaveGAN/pull/306

Fix a noise shape of StyleMelGANGenerator to export ONNX model by @c-bata in https://github.com/kan-bayashi/ParallelWaveGAN/pull/312

add recipe for oniku_kurumi_utagoe db (singing voice synthesis) by @ftshijt in https://github.com/kan-bayashi/ParallelWaveGAN/pull/323

update documentation and correct the download link for oniku-db by @ftshijt in https://github.com/kan-bayashi/ParallelWaveGAN/pull/324

Fix an error librosa update by @kan-bayashi in https://github.com/kan-bayashi/ParallelWaveGAN/pull/327

Add pytorch 1.10.x CI by @kan-bayashi in https://github.com/kan-bayashi/ParallelWaveGAN/pull/328

New Contributors

@windtoker made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/306

@c-bata made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/312

@ftshijt made their first contribution in https://github.com/kan-bayashi/ParallelWaveGAN/pull/323

Full Changelog: https://github.com/kan-bayashi/ParallelWaveGAN/compare/v0.5.3...v0.5.4
Source code(tar.gz)
Source code(zip)
v0.5.3(Aug 26, 2021)
Added pretrained models.

Source code(tar.gz)
Source code(zip)
v0.5.2(Aug 24, 2021)
Added pretrained models

Fixed typo

Introduced filelock to prevent from duplicated download

Source code(tar.gz)
Source code(zip)
v0.5.1(Aug 8, 2021)
Support normalize-before option in decoding

Support log_base option in preprocessing

Update pre-trained models

Source code(tar.gz)
Source code(zip)
v0.5.0(Aug 7, 2021)
Support HiFiGAN

Support StyleMelGAN

Refactored

Source code(tar.gz)
Source code(zip)
v0.4.8(Nov 2, 2020)
Support Pytorch 1.7

Add JSSS recipe

Support the case where wav.scp and segments have a different number of lines

Source code(tar.gz)
Source code(zip)
v0.4.6(Aug 31, 2020)
Minor fix of pretrained model save directory

Source code(tar.gz)
Source code(zip)
v0.4.5(Aug 18, 2020)
Simplify decoding part

Add load_model function

Add inference method in each genearator

Add download_pretraiend_model function to download from google drive directly

Source code(tar.gz)
Source code(zip)
v0.4.3(Aug 16, 2020)
Fix PQMF #204

Source code(tar.gz)
Source code(zip)
v0.4.2(Aug 15, 2020)
Fixed wrong loss naming

Fixed shuffle in distributed training

Fixed pad mode

Updated to be able to change PQMF paramters

Source code(tar.gz)
Source code(zip)
v0.4.1(Jun 28, 2020)
Fix numba version issue

Support multi-key hdf5 scp format

Support npy scp format

Source code(tar.gz)
Source code(zip)
v0.4.0(May 28, 2020)
What's new

Support Multi-band MelGAN

Support Full-band MelGAN

Source code(tar.gz)
Source code(zip)
v0.3.5(May 11, 2020)
What's new

Scripts

Fix the update order #131 #132

Recipes

Add egs/libritts/voc1

Update structure #134

Add template recipe #141

Source code(tar.gz)
Source code(zip)
v0.3.4(Mar 12, 2020)
What's new

Scripts

Support --pretrain option in training

Support --skip-wav-copy option in normalization

Support scp style input for all /bin scripts for ESPnet compatibility

Better parallelization (much faster in the case of the large dataset)

Recipes

Fix format: npy support

Add VCTK recipe

Add melgan.v3 config

Add parallel_wavegan.v3 config

Source code(tar.gz)
Source code(zip)
v0.3.0(Feb 15, 2020)
What's new

Support more recipes

Support new residual discriminator

Support MelGAN generator

Support MelGAN discriminator

And more refactoring...

Source code(tar.gz)
Source code(zip)
v0.2.5(Nov 16, 2019)

Stable version release.
Source code(tar.gz)
Source code(zip)

Owner

Tomoki Hayashi

Postdoctoral researcher @ Nagoya University / COO @ Human Dataware Lab. Co., Ltd.

GitHub https://kan-bayashi.github.io/ParallelWaveGAN/

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

7 Nov 11, 2022

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

134 Dec 16, 2022

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

2 May 14, 2022

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

103 Dec 23, 2022

Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

33 Dec 27, 2022

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021

ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

1 Mar 14, 2022

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

45 Nov 29, 2022

Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

881 Jan 3, 2023

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

33 Sep 22, 2022

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

12 Dec 23, 2022

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 6, 2023

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg：一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用，支持细分领域分词，有效提升了分词准确度。目录主要亮点编译和安装各类分词工具包的性能对比使用方式论文引用作者常见问题及解答主要

6k Dec 29, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

38 Dec 31, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Related tags

Overview

Parallel WaveGAN implementation with Pytorch

What's new

Requirements

Setup

A. Use pip

B. Make virtualenv

Recipe

Speed

Results

How-to-use pretrained models

Analysis-synthesis

Decoding with ESPnet-TTS model's features

Decoding with dumped npy files

References

Acknowledgement

Author

Comments

2020/08/06

Releases(v0.5.5)

v0.5.5(May 17, 2022)

What's Changed

New Contributors

v0.5.4(Feb 10, 2022)

What's Changed

New Contributors

v0.5.3(Aug 26, 2021)

v0.5.2(Aug 24, 2021)

v0.5.1(Aug 8, 2021)

v0.5.0(Aug 7, 2021)

v0.4.8(Nov 2, 2020)

v0.4.6(Aug 31, 2020)

v0.4.5(Aug 18, 2020)

v0.4.3(Aug 16, 2020)

v0.4.2(Aug 15, 2020)

v0.4.1(Jun 28, 2020)

v0.4.0(May 28, 2020)

What's new

v0.3.5(May 11, 2020)

What's new

Scripts

Recipes

v0.3.4(Mar 12, 2020)

What's new

Scripts

Recipes

v0.3.0(Feb 15, 2020)

What's new

v0.2.5(Nov 16, 2019)

Owner

Tomoki Hayashi

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Ray-based parallel data preprocessing for NLP and ML.

This is a project of data parallel that running on NLP tasks.

ReCoin - Restoring our environment and businesses in parallel

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

CorNet Correlation Networks for Extreme Multi-label Text Classification

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning