DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

Overview

DiffWave

PyPI Release License

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

  • unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

  • fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

  • new pretrained model trained for 1M steps
  • updated audio samples with output from new model

Status (2021-11-09)

  • fast inference procedure
  • stable training
  • high-quality synthesis
  • mixed-precision training
  • multi-GPU training
  • command-line inference
  • programmatic inference API
  • PyPI package
  • audio samples
  • pretrained models
  • unconditional waveform synthesis

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

  • trained on 4x 1080Ti
  • default parameters
  • single precision floating point (FP32)
  • trained on LJSpeech dataset excluding LJ001* and LJ002*
  • trained for 1000578 steps (1273 epochs)

Install

Install using pip:

pip install diffwave

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

References

Comments
  • help request. trying to figure out how to match up params for TTS to Vocoder.

    help request. trying to figure out how to match up params for TTS to Vocoder.

    I'm using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.

    The TTS project I'm using shows the below for audio params, but I don't know what to change in either the TTS params or the vocoder params to have them match up. I'm guessing the hop_samples somehow matches up with the sftp_* settings, but, am a bit clueless as to what I'm looking at. I'm thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.

    TTS Tacotron Settings

        sample_rate = 22050                  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
        num_fft = 1102                       # number of frequency bins used during computation of spectrograms
        num_mels = 80                        # number of mel bins used during computation of mel spectrograms
        num_mfcc = 13                        # number of MFCCs, used just for MCD computation (during training)
        stft_window_ms = 50                  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
        stft_shift_ms = 12.5                 # shift of the window (or better said gap between windows) in ms   
    

    diffwave Vocoder Settings

    # Data params
        sample_rate=22050,
        n_mels=80,
        n_fft=1024,
        hop_samples=256,
    
    opened by michael-conrad 10
  • Inference

    Inference

    While inferencing with the provided LJSpeech pretrained model and one of the reference audio, the output is a very low amplitude sound (almost silence). And while I used a trained model over a custom dataset, the result was static noise on inferencing. What could be going wrong?

    opened by Pranjalya 10
  • Adopting diffusion model on TTS

    Adopting diffusion model on TTS

    Hi all, I'm currently playing with DiffSinger, which is a TTS system extended by diffusion models. For the naive version, It consists of encoders (for embedding text and pitch information) and a denoiser where the encoders' output is used to condition the denoiser. Everything is similar to diffwave including denoiser's structure and prediction but the neural net to predict epsilon would be changed to epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) compared to DiffWave's epsilon(noisy_audio, upsampled_spectrogram, diffusion_step). While I'm successfully training encoders, I got an issue during training denoiser. I used LJSpeech. Here is what I did:

    1. First of all, as a preliminary experiment, I try to check all modules to work well by setting denoiser as epsilon(noisy_spectrogram, clean_spectrogram, diffusion_step) to predict the noisy_spectrogram.
    2. After the model converges, I went back to the denoiser of epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) to predict clean_spectrogram. I detached the encoders_output from the auto_grad when the input (to prevent from updating) to the denoiser to fix the conditioner for model convergence. The model was broken when I didn't detach (allow the encoder to be updated during denoiser training).
    3. I found that when the range of the conditioner (encoder_outputs) values is smaller, then the model shows better evidence of successful training.

    Bellows are the results I've got so far. The upper one is the sampled (synthesized) mel-spectrogram, and the lower one is the ground truth of each image.

    1. I can see the model converge during the primary experiment: image
    2. When the encoder's output directly input to the denoiser (value range: -9.xxx to 6.xxx): image
    3. When the encoder's output is multiplied by 0.01 to shrink the range: image

    For case 2., It shows any clues on training. On contrary, the case 3. shows 'some' levels of training but it is not what we expected. I double-checked the inference part (reverse part), but it is exactly the same as that of 1. and diffwave.

    So I just want to know if you have any idea on the successful conditions of the input conditioner of the denoiser. Why does the model show such an unsatisfying result above? Do I miss something to process the conditioner?

    I will appreciate all suggestions or sharing of your experience. Thanks in advance.

    opened by keonlee9420 9
  • Starting an unconditional generation experiment

    Starting an unconditional generation experiment

    For unconditional generation, is that changing y = self.dilated_conv(y) + conditioner in model.py to y = self.dilated_conv(y) avaliable?

    And how to generate samples?

    opened by ladium493 7
  • The audio have some noise.

    The audio have some noise.

    Hi, thanks for your good job. I trained the model on a single speaker dataset which have 10000 utterances, loss is shown in figure. In inference, the audio have some clearly noise, is this dataset too small? Or are there other reasons? 捕获 …]()

    opened by Ziyan0829 5
  • High pitched voices when scaling fft size up to 4096

    High pitched voices when scaling fft size up to 4096

    Let me start by saying that this repo is fantastic. I've successfully synthesized voices and would like to experiment with scaling up fft size and other audio parameters.

    I'm running with the following:

    n_fft: 4096 hop_samples: 256 sample_rate: 32000

    I'm able to train and the loss goes down quite a lot, but when I listen to the sample voices they are very high pitched compared to when training with n_fft = 1024. I think somewhere during training the voices are being squeezed together and messing with the pitch.

    Are there any modifications that need to be made to make this work? For reference I'm training on the ljspeech dataset.

    Thank you!

    opened by egaebel 5
  • Other feature representations besides mel-spect

    Other feature representations besides mel-spect

    I'm doing music related research, and mel-spectrogram doesn't seem to be the best data representation for the task I'm handling with, so I'm considering switching to CQT. I trained DiffWave on music Mel-spectrograms and it yielded very impressive result. I'm wondering whether it makes sense to use some other input representations other than Mel-spectrograms, such as CQT? (The representation is informational enough)

    opened by Irislucent 4
  • Regular sampling and fast sampling not equivalent in unconditional generation

    Regular sampling and fast sampling not equivalent in unconditional generation

    Hi, thank you so much for your implementation.

    I trained one unconditional generator, the fast sampling makes sense during inference using default noise schedule, like this: Screen Shot 2022-05-02 at 8 34 15 PM

    However, when I set fast_sampling to False, still with the default noise schedule, I got this: Screen Shot 2022-05-02 at 8 35 20 PM

    Is this normal? Thanks in advance.

    Also is this setting correct? The maximum beta in two schedules are different here.

    noise_schedule=np.linspace(1e-4, 0.05, 50).tolist()
    inference_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5]
    
    opened by gzhu06 4
  • questions about the codes.

    questions about the codes.

    Thanks for your great work! I study with your codes and It is of great help to me.

    By the way, I have two questions in your scripts.

    In learner.py scripts, The variable 'noise_level'(line 50) takes square root, and it seems to mean 'alpha_cumprod_sqrt' in the paper. but in line120&121, it takes square_root again and the variable 'noise_scale_sqrt' seems to be took square root second time. (+'noise_scale' seems to equal to 'alpha_cumprod_sqrt' in paper.) I thought the input 'noisy_audio' is different from the original paper.

    To confirm whether there is something wrong, I trained both model with&without changing the script(remove **0.5 in line 50). and I found the model without changing the code(same with your scripts) acts better which means your code is right!

    I have checked the scripts several times, but I still find there is something weird in the scripts(alpha_cumprod took sqrt two times) so I cannot understand the given results.

    May I ask you to confirm whether your code is exactly same with the paper and if I missed anything?

    You used L1 distance for the objective function but originally, the loss is euclidian distance in the paper. Is there any reason for using L1 distance? (I knew that wavegrad paper said training with L1 is better!)

    Thanks for your work again. :)

    opened by GANNNN123 4
  • Trying to use pretrained model but failed

    Trying to use pretrained model but failed

    Hi, I have trouble using pre-trained model and badly wants your help.

    I wanted to check the performance of Diffwave with pretrained prameters.

    Since there was no demo for it, I've write my own script that importing pretrained model.

    Purpose of the script is to compare the original audio and generated audio from pretrained vocoder.

    First, I've generated Mel spectrogram from one of audio samples provieded in https://github.com/lmnt-com/diffwave#audio-samples.

    
    # Audio downloaded from audio samples
    # https://github.com/lmnt-com/diffwave#audio-samples
    waveform, sample_rate = get_speech_sample()
    
    # define transformation
    spectrogram = T.MelSpectrogram(
        sample_rate=22050,
        n_fft = 1024,
        hop_length = 256,
        win_length = hop_length * 4,
        f_min = 20.0,
        f_max=sample_rate/2.,
        n_mels=80
    )
    
    # Perform transformation
    spec = spectrogram(waveform)
    spec = 20 * torch.log10(torch.clamp(spec, min=1e-5)) - 20
    spec = torch.clamp((spec + 100) / 100, 0.0, 1.0)
    
    print_stats(spec)
    plot_spectrogram(spec[0], title="torchaudio")
    plot_waveform(waveform, sample_rate)
    
    Shape: (1, 80, 833)
    Dtype: torch.float32
     - Max:      1.000
     - Min:      0.280
     - Mean:     0.698
     - Std Dev:  0.171
    
    tensor([[[0.5525, 0.5410, 0.5013,  ..., 0.4834, 0.5863, 0.6569],
             [0.5485, 0.5346, 0.4632,  ..., 0.4242, 0.6327, 0.6866],
             [0.4129, 0.5611, 0.5924,  ..., 0.4228, 0.6652, 0.7208],
             ...,
             [0.5441, 0.6529, 0.7050,  ..., 0.5078, 0.5972, 0.6283],
             [0.5814, 0.6205, 0.6569,  ..., 0.5178, 0.6150, 0.6492],
             [0.5728, 0.6037, 0.6395,  ..., 0.4996, 0.6498, 0.6952]]])
    

    스크린샷 2022-05-12 오후 9 26 58

    Using the created spectrogram, spec, I've generate audio file which should give similar audio from above.

    from diffwave.inference import predict as diffwave_predict
    
    # Pretrained parameters. given at the 
    # https://github.com/lmnt-com/diffwave#pretrained-models
    model_dir = './diffwave/' 
    spectrogram = spec # get your hands on a spectrogram in [N,C,W] format
    audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True, 
                                          device='cpu')
    plot_waveform(audio, sample_rate)
    play_audio(audio, sample_rate)
    

    스크린샷 2022-05-12 오후 9 30 41

    However, the results was far from the original. It was unstable, and doesn't give similar results on the Demo.

    Is there any problem on my code? or Is there way of properly using pre-trained parameters?

    I would really appreciate if there is any example code that I can use pre-trained model properly.

    Thanks.

    opened by schinavro 2
  • How to match tacotron2?

    How to match tacotron2?

    I have another problem that I try to match tacotron2 https://github.com/begeekmyfriend/tacotron2 ,but the generated audio only have noise. The TTS params is already match diffwave, i found that the only difference is mel's range(preprocess is different). Tacotron2's output mel range is [-4, 4], diffwave's input mel range is [0, 1]. So, i try something to solve this problem.

    • Only change the inference: try to change tacotron's mel range to [0, 1], like the figure. The result become better that i can hear human's voice and some content, but this way lose the speaker's timbre, just like a machine.

    图片

    • Retraining: use tacotron2's mel to training diffwave, after 800k steps, it still only have noise.

    • Retraining: change tacotron2's mel range like (1), and then training diffwave, after 350k steps, it still only have noise.

    Do you have any good suggestions?

    opened by Ziyan0829 2
  • How can we achieve conditional waveform synthesis for SC09 dataset.

    How can we achieve conditional waveform synthesis for SC09 dataset.

    Hello, as you have already give the conditional generated examples in your project page, can this repo achieves conditional waveform synthesis for SC09 dataset?

    opened by lizaitang 1
  • Unconditional Generation Training Time

    Unconditional Generation Training Time

    Hi @sharvil @Andrechang @JCBrouwer thanks for this implementation.

    My issue is about the training time for unconditional generation. It takes me about 5 hours/ epoch on 1 * RTX8000 and most of the time is spent on loss.backward(), with the unconditional setting in #5, I wonder:

    1. Is this common?
    2. Any suggestions for acceleration please?
    3. From how many epochs that you start to have good-quality generations?

    Thanks in advance.

    opened by BenoitWang 0
  • Why using sub09 for class-based generation or unconditional generation?

    Why using sub09 for class-based generation or unconditional generation?

    Hi, I wander why using subset of Speech Command to train but not the whole dataset? In my experiments, diffusion model cannot handle dataset of a large data scale (1M for ~8hours). Did you try the whole dataset and what's the generation performance and the training cost?

    opened by ludanruan 0
  • Unconditional synthesis

    Unconditional synthesis

    I"m running the this command to generate unconditional samples.

    python -m diffwave.inference --fast /path/to/model -o output.wav

    I've trained for almost 4k epochs on 7k+ sounds. I seem to get the same sound (or a very similar one) regardless of training time.

    I have not worked with diffwave before - any tips for debugging this?

    Thanks

    opened by berkeleymalagon 5
  • Code for evaluation in paper

    Code for evaluation in paper

    I found some automatic evaluation metrics mentioned in the paper, where can I find these scripts so that I can reproduce the result and compare with others method.

    image

    opened by v-nhandt21 5
Owner
LMNT
LMNT
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 54 Aug 30, 2021
Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

tarepan 5 Jan 11, 2022
A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers.

torchsynth The fastest synth in the universe. Introduction torchsynth is based upon traditional modular synthesis written in pytorch. It is GPU-option

torchsynth 229 Jan 2, 2023
efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

Ben Hayes 169 Dec 23, 2022
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

zhangtao 146 Dec 29, 2022
Pre-trained Deep Learning models and demos (high quality and extremely fast)

OpenVINO™ Toolkit - Open Model Zoo repository This repository includes optimized deep learning models and a set of demos to expedite development of hi

OpenVINO Toolkit 3.4k Dec 31, 2022
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 172 Dec 23, 2022
Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话). Many modifications t

null 291 Jan 2, 2023
FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

Eren Gölge 81 Dec 8, 2022
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 3, 2023
Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Chunked Autoregressive GAN (CARGAN) Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [compan

Descript 150 Dec 6, 2022
Neural Nano-Optics for High-quality Thin Lens Imaging

Neural Nano-Optics for High-quality Thin Lens Imaging Project Page | Paper | Data Ethan Tseng, Shane Colburn, James Whitehead, Luocheng Huang, Seung-H

Ethan Tseng 39 Dec 5, 2022
NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

Facebook Research 18 Dec 28, 2021
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 279 Jan 4, 2023
This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

DeLightCMU 212 Jan 8, 2023
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

DV Lab 21 Nov 28, 2022
Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

SphereRPN Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021. Authors: Th

Thang Vu 15 Dec 2, 2022