ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library
ERISHA is a multilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available. The term ERISHA means speech in Sanskrit. The framework of ERISHA includes various deep learning architectures such as Global Style Token (GST), Variational Autoencoder (VAE), and Gaussian Mixture Variational Autoencoder (GMVAE), and X-vectors for building prosody encoder.
Currently, the library is in its initial stage of development and will be updated frequently in the coming days.
Stay tuned for more updates, and we are open to collaboration !!!
Installation and Training
Refer INSTALL for initial setup
Available recipes
- Global Style Token GST
- Variational Autoencoder VAE
- Gaussian Mixture VAE GMVAE
- [X-vectors](Proposed work)
Available Features
- Resampling of speech waveforms to target sampling rate in recipes
- Support to train TTS system for other languages
- Support to train Multilingual TTS system for other languages
Upcoming updates
- [User Documentation]
- Pytorch Lightning
- Multiclass N-pair loss
- [Cluster sampling for improving latent representation of speaker and expressivity](Proposed work)
Acknowledgements
This implementation uses code from the following repos: NVIDIA, Keith Ito, Prem Seetharaman, Chengqi Deng,Dannynis, Jhosimar George Arias Figueroa