Neural HMMs are all you need (for high-quality attention-free TTS)
Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter
This is the official code repository for the paper "Neural HMMs are all you need (for high-quality attention-free TTS)". For audio examples, visit our demo page. A pre-trained model is also available.
Setup and training using LJ Speech
- Download and extract the LJ Speech dataset. Place it in the
data
folder such that the directory becomesdata/LJSpeech-1.1
. Otherwise update the filelists indata/filelists
accordingly. - Clone this repository
git clone https://github.com/shivammehta007/Neural-HMM.git
- If using single GPU checkout the branch
gradient_checkpointing
it will help to fit bigger batch size during training.
- If using single GPU checkout the branch
- Initalise the submodules
git submodule init; git submodule update
- Make sure you have docker installed and running.
- It is recommended to use Docker (it manages the CUDA runtime libraries and Python dependencies itself specified in Dockerfile)
- Alternatively, If you do not intend to use Docker, you can use pip to install the dependencies using
pip install -r requirements.txt
- Run
bash start.sh
and it will install all the dependencies and run the container. - Check
src/hparams.py
for hyperparameters and set GPUs.- For multi-GPU training, set GPUs to
[0, 1 ..]
- For CPU training (not recommended), set GPUs to an empty list
[]
- Check the location of transcriptions
- For multi-GPU training, set GPUs to
- Run
python train.py
to train the model.- Checkpoints will be saved in the
hparams.checkpoint_dir
. - Tensorboard logs will be saved in the
hparams.tensorboard_log_dir
.
- Checkpoints will be saved in the
- To resume training, run
python train.py -c <CHECKPOINT_PATH>
Synthesis
- Download our pre-trained LJ Speech model. (This is the exact same model as system NH2 in the paper, but with training continued until reaching 200k updates total.)
- Download Nvidia's WaveGlow model.
- Run jupyter notebook and open
synthesis.ipynb
.
Miscellaneous
Mixed-precision training or full-precision training
- In
src.hparams.py
changehparams.precision
to16
for mixed precision and32
for full precision.
Multi-GPU training or single-GPU training
- Since the code uses PyTorch Lightning, providing more than one element in the list of GPUs will enable multi-GPU training. So change
hparams.gpus
to[0, 1, 2]
for multi-GPU training and single element[0]
for single-GPU training.
Known issues/warnings
PyTorch dataloader
- If you encounter this error message
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
, this is a known issue in PyTorch Dataloader. - It will be fixed when PyTorch releases a new Docker container image with updated version of Torch. If you are not using docker this can be removed with
torch > 1.9.1
Support
If you have any questions or comments, please open an issue on our GitHub repository.
Citation information
If you use or build on our method or code for your research, please cite our paper:
@article{mehta2021neural,
title={Neural {HMM}s are all you need (for high-quality attention-free {TTS})},
author={Mehta, Shivam and Sz{\'e}kely, {\'E}va and Beskow, Jonas and Henter, Gustav Eje},
journal={arXiv preprint arXiv:2108.13320},
year={2021}
}
Acknowledgements
The code implementation is based on Nvidia's implementation of Tacotron 2 and uses PyTorch Lightning for boilerplate-free code.