Unofficial Pytorch Implementation of WaveGrad2

MINDs Lab

Last update: Nov 29, 2022

Related tags

Deep Learning text-to-speech deep-learning end-to-end tts speech-synthesis deep-generative-model

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

More training for WaveGrad-Base setup
Checkpoint release
WaveGrad-Large Decoder
Inference by reduced sampling steps

Requirements

Pytorch
Pytorch-Lightning==1.2.10
The requirements are highlighted in requirements.txt.
We also provide docker setup Dockerfile.

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
etc.

We take LJSpeech as an example hereafter.

Preprocessing

Adjust preprocess.yaml, especially path section.

path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'

run prepare_align.py for some preparations.

python prepare_align.py -c preprocess.yaml

Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.
After that, run preprocess.py.

python preprocess.py -c preprocess.yaml

Alternately, you can align the corpus by yourself.
Download the official MFA package and run it to align the corpus.

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

And then run preprocess.py.

python preprocess.py -c preprocess.yaml

Training

Adjust hparameter.yaml, especially train section.

train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0

If you want to train with other dataset, adjust data section in hparameter.yaml

data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'

run trainer.py

python trainer.py

If you want to resume training from checkpoint, check parser.

parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()

During training, tensorboard logger is logging loss, spectrogram and audio.

tensorboard --logdir=./tensorboard --bind_all

Inference

run inference.py

python inference.py -c <checkpoint_path> --text <'text'>

Or you can run inference.ipynb.

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

Normal LSTM without ZoneOut is applied for encoder.
g2p_en is applied instead of Google's unknown G2P.
Trained with LJSpeech datasdet instead of Google's proprietary dataset.
- Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
MT + SpecAug are not implemented.
hyperparameters
- train.batch_size: 12 for 2 A100 (40GB) GPUs
- train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
- train.decay learning rate decay is applied during training
- train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
- ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
- encoder.channel is reduced to 512 from 1024 or 2048
Current sample page only contains samples from WaveGrad-Base decoder.
TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Seungu Han at MINDs Lab [email protected]
Junhyeok Lee at MINDs Lab [email protected]

Special thanks to

Kang-wook Kim at MINDs Lab
Wonbin Jung at MINDs Lab
Sang Hoon Woo at MINDs Lab

References

Chen et al., WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Chen et al.,WaveGrad: Estimating Gradients for Waveform Generation
Ho et al., Denoising Diffusion Probabilistic Models
Shen et al., Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

WaveGrad2 Official Github.io

The audio samples on our webpage(TBD) are partially derived from:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
WaveGrad2 Official Github.io

You might also like...

Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

One-Shot Free-View Neural Talking Head Synthesis Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Vide

406 Dec 23, 2022

Unofficial Pytorch Lightning implementation of Contrastive Syn-to-Real Generalization (ICLR, 2021)

17 Sep 23, 2021

Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

883 Jan 7, 2023

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

91 Dec 2, 2022

Unofficial PyTorch implementation of MobileViT.

Comments

Missing key fine_tuning while training

Here is log:

  File "/home/tts/wavegrad2/dataloader.py", line 26, in __init__
    self.meta = self.load_metadata(metadata_path)
  File "/home/tts/wavegrad2/dataloader.py", line 94, in load_metadata
    if self.hparams.train.fine_tuning and stripped[2] != self.hparams.train.tuning_speaker:
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 353, in __getattr__
    self._format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 470, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key fine_tuning
    full_key: train.fine_tuning
    object_type=dict

opened by huypl53 1

Unofficial Pytorch Implementation of WaveGrad2

Related tags

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

TODO

Requirements

Datasets

Preprocessing

Training

Inference

Note

Tree

Author

References

You might also like...

Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

Unofficial Pytorch Lightning implementation of Contrastive Syn-to-Real Generalization (ICLR, 2021)

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Unofficial PyTorch implementation of MobileViT.

Unofficial PyTorch implementation of SimCLR by Google Brain

Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

Unofficial PyTorch Implementation of Multi-Singer

unofficial pytorch implementation of RefineGAN

Comments

Missing key fine_tuning while training

Owner

MINDs Lab

An unofficial PyTorch implementation of a federated learning algorithm, FedAvg.

Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al.

Unofficial implementation of Alias-Free Generative Adversarial Networks. (https://arxiv.org/abs/2106.12423) in PyTorch

The author's officially unofficial PyTorch BigGAN implementation.

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

StarGAN-ZSVC: Unofficial PyTorch Implementation

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Unofficial PyTorch implementation of Fastformer based on paper "Fastformer: Additive Attention Can Be All You Need"."