DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

LMNT

Last update: Jan 3, 2023

Related tags

Deep Learning machine-learning text-to-speech neural-network paper speech pytorch speech-synthesis pretrained-models vocoder diffwave

Overview

DiffWave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

new pretrained model trained for 1M steps
updated audio samples with output from new model

Status (2021-11-09)

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

trained on 4x 1080Ti
default parameters
single precision floating point (FP32)
trained on LJSpeech dataset excluding LJ001* and LJ002*
trained for 1000578 steps (1273 epochs)

Install

Install using pip:

pip install diffwave

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

References

Comments

help request. trying to figure out how to match up params for TTS to Vocoder.

I'm using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.

The TTS project I'm using shows the below for audio params, but I don't know what to change in either the TTS params or the vocoder params to have them match up. I'm guessing the hop_samples somehow matches up with the sftp_* settings, but, am a bit clueless as to what I'm looking at. I'm thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.

TTS Tacotron Settings

    sample_rate = 22050                  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102                       # number of frequency bins used during computation of spectrograms
    num_mels = 80                        # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13                        # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50                  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5                 # shift of the window (or better said gap between windows) in ms

diffwave Vocoder Settings

# Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1024,
    hop_samples=256,

opened by michael-conrad 10

Inference

While inferencing with the provided LJSpeech pretrained model and one of the reference audio, the output is a very low amplitude sound (almost silence). And while I used a trained model over a custom dataset, the result was static noise on inferencing. What could be going wrong?

opened by Pranjalya 10
Adopting diffusion model on TTS
Hi all, I'm currently playing with DiffSinger, which is a TTS system extended by diffusion models. For the naive version, It consists of encoders (for embedding text and pitch information) and a denoiser where the encoders' output is used to condition the denoiser. Everything is similar to diffwave including denoiser's structure and prediction but the neural net to predict epsilon would be changed to epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) compared to DiffWave's epsilon(noisy_audio, upsampled_spectrogram, diffusion_step). While I'm successfully training encoders, I got an issue during training denoiser. I used LJSpeech. Here is what I did:

First of all, as a preliminary experiment, I try to check all modules to work well by setting denoiser as epsilon(noisy_spectrogram, clean_spectrogram, diffusion_step) to predict the noisy_spectrogram.

After the model converges, I went back to the denoiser of epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) to predict clean_spectrogram. I detached the encoders_output from the auto_grad when the input (to prevent from updating) to the denoiser to fix the conditioner for model convergence. The model was broken when I didn't detach (allow the encoder to be updated during denoiser training).

I found that when the range of the conditioner (encoder_outputs) values is smaller, then the model shows better evidence of successful training.

Bellows are the results I've got so far. The upper one is the sampled (synthesized) mel-spectrogram, and the lower one is the ground truth of each image.

I can see the model converge during the primary experiment:

When the encoder's output directly input to the denoiser (value range: -9.xxx to 6.xxx):

When the encoder's output is multiplied by 0.01 to shrink the range:

For case 2., It shows any clues on training. On contrary, the case 3. shows 'some' levels of training but it is not what we expected. I double-checked the inference part (reverse part), but it is exactly the same as that of 1. and diffwave.

So I just want to know if you have any idea on the successful conditions of the input conditioner of the denoiser. Why does the model show such an unsatisfying result above? Do I miss something to process the conditioner?

I will appreciate all suggestions or sharing of your experience. Thanks in advance.
opened by keonlee9420 9
Starting an unconditional generation experiment

For unconditional generation, is that changing y = self.dilated_conv(y) + conditioner in model.py to y = self.dilated_conv(y) avaliable?

And how to generate samples?

opened by ladium493 7
The audio have some noise.

Hi, thanks for your good job. I trained the model on a single speaker dataset which have 10000 utterances, loss is shown in figure. In inference, the audio have some clearly noise, is this dataset too small? Or are there other reasons? …]()

opened by Ziyan0829 5
High pitched voices when scaling fft size up to 4096

Let me start by saying that this repo is fantastic. I've successfully synthesized voices and would like to experiment with scaling up fft size and other audio parameters.

I'm running with the following:

n_fft: 4096 hop_samples: 256 sample_rate: 32000

I'm able to train and the loss goes down quite a lot, but when I listen to the sample voices they are very high pitched compared to when training with n_fft = 1024. I think somewhere during training the voices are being squeezed together and messing with the pitch.

Are there any modifications that need to be made to make this work? For reference I'm training on the ljspeech dataset.

Thank you!

opened by egaebel 5
Other feature representations besides mel-spect

I'm doing music related research, and mel-spectrogram doesn't seem to be the best data representation for the task I'm handling with, so I'm considering switching to CQT. I trained DiffWave on music Mel-spectrograms and it yielded very impressive result. I'm wondering whether it makes sense to use some other input representations other than Mel-spectrograms, such as CQT? (The representation is informational enough)

opened by Irislucent 4
Regular sampling and fast sampling not equivalent in unconditional generation
Hi, thank you so much for your implementation.

I trained one unconditional generator, the fast sampling makes sense during inference using default noise schedule, like this:

However, when I set fast_sampling to False, still with the default noise schedule, I got this:

Is this normal? Thanks in advance.

Also is this setting correct? The maximum beta in two schedules are different here.

noise_schedule=np.linspace(1e-4, 0.05, 50).tolist() inference_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5]
opened by gzhu06 4
questions about the codes.
Thanks for your great work! I study with your codes and It is of great help to me.

By the way, I have two questions in your scripts.

In learner.py scripts, The variable 'noise_level'(line 50) takes square root, and it seems to mean 'alpha_cumprod_sqrt' in the paper. but in line120&121, it takes square_root again and the variable 'noise_scale_sqrt' seems to be took square root second time. (+'noise_scale' seems to equal to 'alpha_cumprod_sqrt' in paper.) I thought the input 'noisy_audio' is different from the original paper.

To confirm whether there is something wrong, I trained both model with&without changing the script(remove **0.5 in line 50). and I found the model without changing the code(same with your scripts) acts better which means your code is right!

I have checked the scripts several times, but I still find there is something weird in the scripts(alpha_cumprod took sqrt two times) so I cannot understand the given results.

May I ask you to confirm whether your code is exactly same with the paper and if I missed anything?

You used L1 distance for the objective function but originally, the loss is euclidian distance in the paper. Is there any reason for using L1 distance? (I knew that wavegrad paper said training with L1 is better!)

Thanks for your work again. :)
opened by GANNNN123 4

Trying to use pretrained model but failed

Hi, I have trouble using pre-trained model and badly wants your help.

I wanted to check the performance of Diffwave with pretrained prameters.

Since there was no demo for it, I've write my own script that importing pretrained model.

Purpose of the script is to compare the original audio and generated audio from pretrained vocoder.

First, I've generated Mel spectrogram from one of audio samples provieded in https://github.com/lmnt-com/diffwave#audio-samples.


# Audio downloaded from audio samples
# https://github.com/lmnt-com/diffwave#audio-samples
waveform, sample_rate = get_speech_sample()

# define transformation
spectrogram = T.MelSpectrogram(
    sample_rate=22050,
    n_fft = 1024,
    hop_length = 256,
    win_length = hop_length * 4,
    f_min = 20.0,
    f_max=sample_rate/2.,
    n_mels=80
)

# Perform transformation
spec = spectrogram(waveform)
spec = 20 * torch.log10(torch.clamp(spec, min=1e-5)) - 20
spec = torch.clamp((spec + 100) / 100, 0.0, 1.0)

print_stats(spec)
plot_spectrogram(spec[0], title="torchaudio")
plot_waveform(waveform, sample_rate)

Shape: (1, 80, 833)
Dtype: torch.float32
 - Max:      1.000
 - Min:      0.280
 - Mean:     0.698
 - Std Dev:  0.171

tensor([[[0.5525, 0.5410, 0.5013,  ..., 0.4834, 0.5863, 0.6569],
         [0.5485, 0.5346, 0.4632,  ..., 0.4242, 0.6327, 0.6866],
         [0.4129, 0.5611, 0.5924,  ..., 0.4228, 0.6652, 0.7208],
         ...,
         [0.5441, 0.6529, 0.7050,  ..., 0.5078, 0.5972, 0.6283],
         [0.5814, 0.6205, 0.6569,  ..., 0.5178, 0.6150, 0.6492],
         [0.5728, 0.6037, 0.6395,  ..., 0.4996, 0.6498, 0.6952]]])

스크린샷 2022-05-12 오후 9 26 58

Using the created spectrogram, spec, I've generate audio file which should give similar audio from above.

from diffwave.inference import predict as diffwave_predict

# Pretrained parameters. given at the 
# https://github.com/lmnt-com/diffwave#pretrained-models
model_dir = './diffwave/' 
spectrogram = spec # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True, 
                                      device='cpu')
plot_waveform(audio, sample_rate)
play_audio(audio, sample_rate)

스크린샷 2022-05-12 오후 9 30 41

However, the results was far from the original. It was unstable, and doesn't give similar results on the Demo.

Is there any problem on my code? or Is there way of properly using pre-trained parameters?

I would really appreciate if there is any example code that I can use pre-trained model properly.

Thanks.

opened by schinavro 2

How to match tacotron2?
I have another problem that I try to match tacotron2 https://github.com/begeekmyfriend/tacotron2 ,but the generated audio only have noise. The TTS params is already match diffwave, i found that the only difference is mel's range(preprocess is different). Tacotron2's output mel range is [-4, 4], diffwave's input mel range is [0, 1]. So, i try something to solve this problem.

Only change the inference: try to change tacotron's mel range to [0, 1], like the figure. The result become better that i can hear human's voice and some content, but this way lose the speaker's timbre, just like a machine.

Retraining: use tacotron2's mel to training diffwave, after 800k steps, it still only have noise.

Retraining: change tacotron2's mel range like (1), and then training diffwave, after 350k steps, it still only have noise.

Do you have any good suggestions?
opened by Ziyan0829 2
How can we achieve conditional waveform synthesis for SC09 dataset.

Hello, as you have already give the conditional generated examples in your project page, can this repo achieves conditional waveform synthesis for SC09 dataset?

opened by lizaitang 1
Unconditional Generation Training Time
Hi @sharvil @Andrechang @JCBrouwer thanks for this implementation.

My issue is about the training time for unconditional generation. It takes me about 5 hours/ epoch on 1 * RTX8000 and most of the time is spent on loss.backward(), with the unconditional setting in #5, I wonder:

Is this common?

Any suggestions for acceleration please?

From how many epochs that you start to have good-quality generations?

Thanks in advance.
opened by BenoitWang 0
Why using sub09 for class-based generation or unconditional generation?

Hi, I wander why using subset of Speech Command to train but not the whole dataset? In my experiments, diffusion model cannot handle dataset of a large data scale (1M for ~8hours). Did you try the whole dataset and what's the generation performance and the training cost?

opened by ludanruan 0
Unconditional synthesis

I"m running the this command to generate unconditional samples.

python -m diffwave.inference --fast /path/to/model -o output.wav

I've trained for almost 4k epochs on 7k+ sounds. I seem to get the same sound (or a very similar one) regardless of training time.

I have not worked with diffwave before - any tips for debugging this?

Thanks

opened by berkeleymalagon 5
Code for evaluation in paper

I found some automatic evaluation metrics mentioned in the paper, where can I find these scripts so that I can reproduce the result and compare with others method.

opened by v-nhandt21 5

Owner

LMNT

GitHub

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

54 Aug 30, 2021

Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

5 Jan 11, 2022

A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers.

torchsynth The fastest synth in the universe. Introduction torchsynth is based upon traditional modular synthesis written in pytorch. It is GPU-option

229 Jan 2, 2023

efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

169 Dec 23, 2022

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

146 Dec 29, 2022

Pre-trained Deep Learning models and demos (high quality and extremely fast)

OpenVINO™ Toolkit - Open Model Zoo repository This repository includes optimized deep learning models and a set of demos to expedite development of hi

3.4k Dec 31, 2022

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

172 Dec 23, 2022

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话）. Many modifications t

291 Jan 2, 2023

FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

81 Dec 8, 2022

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

96 Jan 3, 2023

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Chunked Autoregressive GAN (CARGAN) Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [compan

150 Dec 6, 2022

Implementation of Common Image Evaluation Metrics by Sayed Nadim (sayednadim.github.io). The repo is built based on full reference image quality metrics such as L1, L2, PSNR, SSIM, LPIPS. and feature-level quality metrics such as FID, IS. It can be used for evaluating image denoising, colorization, inpainting, deraining, dehazing etc. where we have access to ground truth.

Image Quality Evaluation Metrics Implementation of some common full reference image quality metrics. The repo is built based on full reference image q

10 Jan 1, 2023

Neural Nano-Optics for High-quality Thin Lens Imaging

Neural Nano-Optics for High-quality Thin Lens Imaging Project Page | Paper | Data Ethan Tseng, Shane Colburn, James Whitehead, Luocheng Huang, Seung-H

39 Dec 5, 2022

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

18 Dec 28, 2021

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

86 Dec 7, 2022

Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

SphereRPN Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021. Authors: Th

15 Dec 2, 2022

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

Related tags

Overview

DiffWave

What's new (2021-11-09)

What's new (2021-04-01)

What's new (2020-10-14)

Status (2021-11-09)

Audio samples

Pretrained models

Pre-trained model details

Install

Training

Multi-GPU training

Inference API

Inference CLI

References

Comments

TTS Tacotron Settings

diffwave Vocoder Settings

Owner

LMNT

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Fast and Simple Neural Vocoder, the Multiband RNNMS

A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers.

efficient neural audio synthesis in the waveform domain

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

Pre-trained Deep Learning models and demos (high quality and extremely fast)

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

FFTNet vocoder implementation

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Neural Nano-Optics for High-quality Thin Lens Imaging

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,