A Flow-based Generative Network for Speech Synthesis

Overview

WaveGlow

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, and Bryan Catanzaro

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.

Setup

  1. Clone our repo and initialize submodule

    git clone https://github.com/NVIDIA/waveglow.git
    cd waveglow
    git submodule init
    git submodule update
  2. Install requirements pip3 install -r requirements.txt

  3. Install Apex

Generate audio with our pre-existing model

  1. Download our published model
  2. Download mel-spectrograms
  3. Generate audio python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6

N.b. use convert_model.py to convert your older models to the current model with fused residual and skip connections.

Train your own model

  1. Download LJ Speech Data. In this example it's in data/

  2. Make a list of the file names to use for training/testing

    ls data/*.wav | tail -n+10 > train_files.txt
    ls data/*.wav | head -n10 > test_files.txt
  3. Train your WaveGlow networks

    mkdir checkpoints
    python train.py -c config.json

    For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

    For mixed precision training set "fp16_run": true on config.json.

  4. Make test set mel-spectrograms

    python mel2samp.py -f test_files.txt -o . -c config.json

  5. Do inference with your network

    ls *.pt > mel_files.txt
    python3 inference.py -f mel_files.txt -w checkpoints/waveglow_10000 -o . --is_fp16 -s 0.6
Comments
  • How long will it take to get a good result?

    How long will it take to get a good result?

    the train.py's print is: 46739: -4.754265308 46740: -5.550816059 46741: -4.253830433 46742: -5.338192463 46743: -4.700691700 46744: -5.625311375 46745: -5.753829479 46746: -5.032420158 ......

    which value when the second number equals, the chenkpoint is useable?

    opened by NileZhou 41
  • Cant load waveglow checkpoint into inference code even after convert

    Cant load waveglow checkpoint into inference code even after convert

    I have trained a new waveglow model for my language, but I cant load it into Tacotron2 inference.ipynb file to run test. It return this error: AttributeError: 'WN' object has no attribute 'cond_layers'

    I tried convert the checkpoint using convert_model.py file in waveglow folder but it still raise the same error

    opened by EuphoriaCelestial 37
  • pretrained model which can resume training

    pretrained model which can resume training

    @rafaelvalle Thanks for your sharing! It helps a lot. I find it trains very slow (about 1 epoch/day,batch size=1) when I trained model using my data(about 12h). Could you offer a model which can resume training from it ?

    Thanks a lot !!

    opened by emmacirl 25
  • CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k.

    CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k.

    After days of different approaches, we've decided to finetune the NVIDIA pre-trained model on different language dataset. We're running model on 8 V100-s with 16GB of VRAM. Our dataset was recorded at 48kHz, and then downsampled to 22050. Checkpoint is converted. Additionally, while getting out-of-memory errors on 8 GPUs, we've decreased the batch size to 6, and segment length to 12000. Even now, 3 of 8 GPUs die, and batches are distributed on 5 of them. Training time has been increased drastically.

    Any thoughts from collaborators/contributors?

    opened by deepconsc 21
  • Abrupt noise,

    Abrupt noise,

    Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example: image image

    The audio file is : LJ001-0007.wav_synthesis_01.zip

    My config.json file is: image

    I used single GPU。

    Look forward your help!

    opened by UESTCgan 20
  • Can single GPU get good result?

    Can single GPU get good result?

    Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...

    opened by Cheneng 18
  • How to train new Waveglow model for diffirent language?

    How to train new Waveglow model for diffirent language?

    As the title, I would like to know how to train a new model using another dataset, which have the same structure as LJ Speech dataset. What modifications need to be done for a diffirent language?

    opened by EuphoriaCelestial 17
  • Loss instability during training

    Loss instability during training

    I have already trained a WaveGlow model from scratch using LJ Speech dataset and everything worked well during training.

    I now try to train a new model using a private dataset that contains only 2 hours of speech. Some audios are inferior to segment_length=16000 (approximatively 10 audios for a dataset composed of 2300 audios). This training is performed in FP32 and, except for batch_size=24, I use the same hyper-parameters than the one in config.json.

    Training loss decreases slowly during 120k iterations (which represents a lot of epochs for my small dataset) but further iterations lead to two types of errors:

    • NaN loss: I tracked det(W) at each layer to check if the determinant was crossing between positive and negative values, but when loss becomes NaN, all determinants are strictly positive. I don't really know which other term in the loss could cause a NaN issue.
    • Jump in the loss: The loss jumps from negative values to positive ones. The loss continues to diverge at next iterations and model forgets what it learned.

    I already used this private dataset using WaveNet and everything worked well, which assumes that this dataset is not corrupted.

    I tried to decrease the learning rate but instability still persists. Any insight or help to understand these problems would be greatly appreciated.

    opened by julianzaidi 17
  • During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like?

    During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like?

    I implemented a waveglow model in my own project. The codes are almost the same as this repo with some modifications:

    1. Upsample the mel-spectrogram to number of groups, so n_mel_channels in WN reduce to 80.
    2. Change logdet() in invertable1x1 to det().abs().log() as #35 did because the first few runs I did the loss became nan after thousands of steps.

    The n_channels is 256 so the model size is 4~5 times smaller than original. I run model on 2 1080ti using nn.DataParallel with batch size of 8. After about 5k steps the loss is around -6 ~ -7 and I can hear some speech-like sentences from the model outputs. Then the loss value started to go up and down, even over zero, and cannot go any further. I add 24 flows in the model, doesn't help; 32 batch size, the problem still exist. Maybe more steps it will become better, but after 70k steps still cannot see any improvement. Did anyone have similar problems?

    I also want to ask about the final loss as reference. In my case, -11 is the smallest value I can get, then the aforementioned problem happened. In #5 @azraelkuan can get a -18 loss value at 56k, is it the normal loss value?

    opened by yoyololicon 15
  • problem about  mel2samp.py

    problem about mel2samp.py

    It's a very nice work! I'm study it! But when I'm training the model, I find a problem. May, there is some mistake in mel2samp.py. It can't get a right mel. When I use the "mel_spectrograms" that the NVIDIA gives, the results that "waveglow_old.pt" infers are nice. But ,when I use "python mel2samp.py -f test_files.txt -o . -c config.json" to get mel, the results that "waveglow_old.pt" infers are bad. The voice is from man, not the original woman.

    the result of "mel_spectrograms" that the NVIDIA gives: image

    the result of mel2samp.py image

    This is my audio. my_results_wav.zip

    opened by UESTCgan 14
  • Inference time 3 times slower than real-time on single GTX 1080ti

    Inference time 3 times slower than real-time on single GTX 1080ti

    I have tested NVIDIA/tacotron2+waveglow by using inference.ipynb with pretrained models on single GTX 1080ti. Inference time was 3 times slower than real-time. It spent 24s for generate 7s voice.

    Is it possible to generate faster than real-time on single GTX 1080ti. Thank you.

    Here is screenshot while generating audio. while_generating_nvidia_smi

    opened by dnnnew 13
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi NVIDIA/waveglow!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 0
  • spectrogram (image)-to-to wav

    spectrogram (image)-to-to wav

    Dear Authors and readers

    I would appreciate it if you would give me an answer to my question:

    Is it possible to convert the spectrogram image (not the array) to wav (reconstruct the wav audio from the spectrogram in image form)?

    opened by ahmeftah 0
  • Update train.py

    Update train.py

    When loading the published pre-trained model the dict does not contain neither the optimizer nor the iteration so it will raise a KeyError exception, I changed it to make the iteration 0 if it's not found, and to return the optimizer itself if it is not in the dict

    opened by msalhab96 0
  • An important issue on multispeaker inference

    An important issue on multispeaker inference

    Hello, I have an issue about whether waveglow could be used in a multispeaker setting. For example, could it be trained in VCTK dataset and infer well?
    Has anyone tried this? The issues didn't have a definite answer. Thank you very much.

    opened by hongchengzhu 0
  • build(deps): bump numpy from 1.13.3 to 1.22.0

    build(deps): bump numpy from 1.13.3 to 1.22.0

    Bumps numpy from 1.13.3 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Owner
NVIDIA Corporation
NVIDIA Corporation
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Rishikesh (ऋषिकेश) 31 Dec 8, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

Himangi Mittal 50 Nov 22, 2022
Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder

ASEGAN: Speech Enhancement Generative Adversarial Network Based on Asymmetric AutoEncoder 中文版简介 Readme with English Version 介绍 基于SEGAN模型的改进版本,使用自主设计的非

Nitin 53 Nov 17, 2022
PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 20

Zhengqi Li 585 Jan 4, 2023
[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

NeRFlow [ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing Datasets The pouring dataset used for experiments can be download he

null 44 Dec 20, 2022
Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

Emmanuel Bengio 381 Jan 4, 2023
Generative Flow Networks for Discrete Probabilistic Modeling

Energy-based GFlowNets Code for Generative Flow Networks for Discrete Probabilistic Modeling by Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Vo

Narsil-Dinghuai Zhang 51 Dec 20, 2022
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 63 Jan 2, 2023
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

sarulab-speech 24 Sep 7, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Jan 8, 2023
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 65 Dec 20, 2022
π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis Project Page | Paper | Data Eric Ryan Chan*, Marco Monteiro*, Pe

null 375 Dec 31, 2022
Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

Han Zhang 809 Dec 16, 2022
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

null 349 Dec 29, 2022
A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis This is the pytorch implementation for our MICCAI 2021 paper. A Mul

Jiarong Ye 7 Apr 4, 2022
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis Multi-View Consistent Generative Adversarial Networks for 3D-aware

Xuanmeng Zhang 78 Dec 10, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022