A Flow-based Generative Network for Speech Synthesis

NVIDIA Corporation

Last update: Dec 26, 2022

Related tags

Deep Learning waveglow

Overview

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, and Bryan Catanzaro

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.

Setup

Clone our repo and initialize submodule

git clone https://github.com/NVIDIA/waveglow.git
cd waveglow
git submodule init
git submodule update

Install requirements pip3 install -r requirements.txt
Install Apex

Generate audio with our pre-existing model

Download our published model
Download mel-spectrograms
Generate audio python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6

N.b. use convert_model.py to convert your older models to the current model with fused residual and skip connections.

Train your own model

Download LJ Speech Data. In this example it's in data/

Make a list of the file names to use for training/testing

ls data/*.wav | tail -n+10 > train_files.txt
ls data/*.wav | head -n10 > test_files.txt

Train your WaveGlow networks
```
mkdir checkpoints
python train.py -c config.json
```
For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

For mixed precision training set "fp16_run": true on config.json.
Make test set mel-spectrograms

python mel2samp.py -f test_files.txt -o . -c config.json

Do inference with your network

ls *.pt > mel_files.txt
python3 inference.py -f mel_files.txt -w checkpoints/waveglow_10000 -o . --is_fp16 -s 0.6

Comments

How long will it take to get a good result?

the train.py's print is: 46739: -4.754265308 46740: -5.550816059 46741: -4.253830433 46742: -5.338192463 46743: -4.700691700 46744: -5.625311375 46745: -5.753829479 46746: -5.032420158 ......

which value when the second number equals, the chenkpoint is useable?

opened by NileZhou 41
Cant load waveglow checkpoint into inference code even after convert

I have trained a new waveglow model for my language, but I cant load it into Tacotron2 inference.ipynb file to run test. It return this error: AttributeError: 'WN' object has no attribute 'cond_layers'

I tried convert the checkpoint using convert_model.py file in waveglow folder but it still raise the same error

opened by EuphoriaCelestial 37
pretrained model which can resume training

@rafaelvalle Thanks for your sharing! It helps a lot. I find it trains very slow (about 1 epoch/day,batch size=1) when I trained model using my data(about 12h). Could you offer a model which can resume training from it ?

Thanks a lot !!

opened by emmacirl 25
CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k.

After days of different approaches, we've decided to finetune the NVIDIA pre-trained model on different language dataset. We're running model on 8 V100-s with 16GB of VRAM. Our dataset was recorded at 48kHz, and then downsampled to 22050. Checkpoint is converted. Additionally, while getting out-of-memory errors on 8 GPUs, we've decreased the batch size to 6, and segment length to 12000. Even now, 3 of 8 GPUs die, and batches are distributed on 5 of them. Training time has been increased drastically.

Any thoughts from collaborators/contributors?

opened by deepconsc 21
Abrupt noise，

Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example:

The audio file is : LJ001-0007.wav_synthesis_01.zip

My config.json file is:

I used single GPU。

Look forward your help!

opened by UESTCgan 20
Can single GPU get good result?

Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...

opened by Cheneng 18
How to train new Waveglow model for diffirent language?

As the title, I would like to know how to train a new model using another dataset, which have the same structure as LJ Speech dataset. What modifications need to be done for a diffirent language?

opened by EuphoriaCelestial 17
Loss instability during training
I have already trained a WaveGlow model from scratch using LJ Speech dataset and everything worked well during training.

I now try to train a new model using a private dataset that contains only 2 hours of speech. Some audios are inferior to segment_length=16000 (approximatively 10 audios for a dataset composed of 2300 audios). This training is performed in FP32 and, except for batch_size=24, I use the same hyper-parameters than the one in config.json.

Training loss decreases slowly during 120k iterations (which represents a lot of epochs for my small dataset) but further iterations lead to two types of errors:

NaN loss: I tracked det(W) at each layer to check if the determinant was crossing between positive and negative values, but when loss becomes NaN, all determinants are strictly positive. I don't really know which other term in the loss could cause a NaN issue.

Jump in the loss: The loss jumps from negative values to positive ones. The loss continues to diverge at next iterations and model forgets what it learned.

I already used this private dataset using WaveNet and everything worked well, which assumes that this dataset is not corrupted.

I tried to decrease the learning rate but instability still persists. Any insight or help to understand these problems would be greatly appreciated.
opened by julianzaidi 17
During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like?
I implemented a waveglow model in my own project. The codes are almost the same as this repo with some modifications:

Upsample the mel-spectrogram to number of groups, so n_mel_channels in WN reduce to 80.

Change logdet() in invertable1x1 to det().abs().log() as #35 did because the first few runs I did the loss became nan after thousands of steps.

The n_channels is 256 so the model size is 4~5 times smaller than original. I run model on 2 1080ti using nn.DataParallel with batch size of 8. After about 5k steps the loss is around -6 ~ -7 and I can hear some speech-like sentences from the model outputs. Then the loss value started to go up and down, even over zero, and cannot go any further. I add 24 flows in the model, doesn't help; 32 batch size, the problem still exist. Maybe more steps it will become better, but after 70k steps still cannot see any improvement. Did anyone have similar problems?

I also want to ask about the final loss as reference. In my case, -11 is the smallest value I can get, then the aforementioned problem happened. In #5 @azraelkuan can get a -18 loss value at 56k, is it the normal loss value?
opened by yoyololicon 15
problem about mel2samp.py

It's a very nice work! I'm study it! But when I'm training the model, I find a problem. May, there is some mistake in mel2samp.py. It can't get a right mel. When I use the "mel_spectrograms" that the NVIDIA gives, the results that "waveglow_old.pt" infers are nice. But ，when I use "python mel2samp.py -f test_files.txt -o . -c config.json" to get mel, the results that "waveglow_old.pt" infers are bad. The voice is from man, not the original woman.

the result of "mel_spectrograms" that the NVIDIA gives:

the result of mel2samp.py

This is my audio. my_results_wav.zip

opened by UESTCgan 14
Inference time 3 times slower than real-time on single GTX 1080ti

I have tested NVIDIA/tacotron2+waveglow by using inference.ipynb with pretrained models on single GTX 1080ti. Inference time was 3 times slower than real-time. It spent 24s for generate 7s voice.

Is it possible to generate faster than real-time on single GTX 1080ti. Thank you.

Here is screenshot while generating audio.

opened by dnnnew 13
Add CodeQL workflow for GitHub code scanning
Hi NVIDIA/waveglow!

This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

Questions? Check out the FAQ below!

FAQ

Click here to expand the FAQ section

How often will the code scanning analysis run?

By default, code scanning will trigger a scan with the CodeQL engine on the following events:

On every pull request — to flag up potential security problems for you to investigate before merging a PR.

On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.

Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

What will this cost?

Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

What types of problems does CodeQL find?

The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

How do I upgrade my CodeQL engine?

No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

The analysis doesn’t seem to be working

If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

How do I disable LGTM.com?

If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

Which source code hosting platforms does code scanning support?

GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

How do I know this PR is legitimate?

This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

I have another question / how do I get in touch?

Please join the discussion here to ask further questions and send us suggestions!
opened by lgtm-com[bot] 0
spectrogram (image)-to-to wav

Dear Authors and readers

I would appreciate it if you would give me an answer to my question:

Is it possible to convert the spectrogram image (not the array) to wav (reconstruct the wav audio from the spectrogram in image form)?

opened by ahmeftah 0
Update train.py

When loading the published pre-trained model the dict does not contain neither the optimizer nor the iteration so it will raise a KeyError exception, I changed it to make the iteration 0 if it's not found, and to return the optimizer itself if it is not in the dict

opened by msalhab96 0
An important issue on multispeaker inference

Hello, I have an issue about whether waveglow could be used in a multispeaker setting. For example, could it be trained in VCTK dataset and infer well?
Has anyone tried this? The issues didn't have a definite answer. Thank you very much.

opened by hongchengzhu 0
build(deps): bump numpy from 1.13.3 to 1.22.0
Bumps numpy from 1.13.3 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Owner

NVIDIA Corporation

GitHub

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

31 Dec 8, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

50 Nov 22, 2022

Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍基于SEGAN模型的改进版本，使用自主设计的非

53 Nov 17, 2022

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 20

585 Jan 4, 2023

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

NeRFlow [ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing Datasets The pouring dataset used for experiments can be download he

44 Dec 20, 2022

Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

381 Jan 4, 2023

Generative Flow Networks for Discrete Probabilistic Modeling

Energy-based GFlowNets Code for Generative Flow Networks for Discrete Probabilistic Modeling by Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Vo

51 Dec 20, 2022

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

63 Jan 2, 2023

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

24 Sep 7, 2022

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

1.8k Jan 8, 2023

Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

65 Dec 20, 2022

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis Project Page | Paper | Data Eric Ryan Chan*, Marco Monteiro*, Pe

375 Dec 31, 2022

Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

809 Dec 16, 2022

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

69 Dec 17, 2022

A Flow-based Generative Network for Speech Synthesis

Related tags

Overview

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, and Bryan Catanzaro

Setup

Generate audio with our pre-existing model

Train your own model

Comments

FAQ

How often will the code scanning analysis run?

What will this cost?

What types of problems does CodeQL find?

How do I upgrade my CodeQL engine?

The analysis doesn’t seem to be working

How do I disable LGTM.com?

Which source code hosting platforms does code scanning support?

How do I know this PR is legitimate?

I have another question / how do I get in touch?

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Owner

NVIDIA Corporation

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Just Go with the Flow: Self-Supervised Scene Flow Estimation

Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

Generative Flow Networks

Generative Flow Networks for Discrete Probabilistic Modeling

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio