DiffWave
DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.
What's new (2021-11-09)
- unconditional waveform synthesis (thanks to Andrechang!)
What's new (2021-04-01)
- fast sampling algorithm based on v3 of the DiffWave paper
What's new (2020-10-14)
- new pretrained model trained for 1M steps
- updated audio samples with output from new model
Status (2021-11-09)
- fast inference procedure
- stable training
- high-quality synthesis
- mixed-precision training
- multi-GPU training
- command-line inference
- programmatic inference API
- PyPI package
- audio samples
- pretrained models
- unconditional waveform synthesis
Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.
Audio samples
Pretrained models
22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8
)
This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).
Pre-trained model details
- trained on 4x 1080Ti
- default parameters
- single precision floating point (FP32)
- trained on LJSpeech dataset excluding LJ001* and LJ002*
- trained for 1000578 steps (1273 epochs)
Install
Install using pip:
pip install diffwave
or from GitHub:
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Training
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).
Multi-GPU training
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
. You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Inference API
Basic usage:
from diffwave.inference import predict as diffwave_predict
model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
# audio is a GPU tensor in [N,T] format.
Inference CLI
python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav