End-2-end speech synthesis with recurrent neural networks

Overview

Introduction

New: Interactive demo using Google Colaboratory can be found here

TTS-Cube is an end-2-end speech synthesis system that provides a full processing pipeline to train and deploy TTS models.

It is entirely based on neural networks, requires no pre-aligned data and can be trained to produce audio just by using character or phoneme sequences.

Markdown does not allow embedding of audio files. For a better experience check-out the project's website.

For installation please follow these instructions. Training and usage examples can be found here. A notebook demo can be found here.

Output examples

Encoder outputs:

"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat." encoder_output_1

"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco." encoder_output_2

Vocoder output (conditioned on gold-standard data)

Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)

original        vocoder

original        vocoder

original        vocoder

End to end decoding

The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.

synthesized         original(unseen)

synthesized         original(unseen)

synthesized         original(unseen)

synthesized         original(unseen)

Technical details

TTS-Cube is based on concepts described in Tacotron (1 and 2), Char2Wav and WaveRNN, but it's architecture does not stick to the exact recipes:

  • It has a dual-architecture, composed of (a) a module (Encoder) that converts sequences of characters or phonemes into mel-log spectrogram and (b) a RNN-based Vocoder that is conditioned on the spectrogram to produce audio
  • The Encoder is similar to those proposed in Tacotron (Wang et al., 2017) and Char2Wav (Sotelo et al., 2017), but
    • has a lightweight architecture with just a two-layer BDLSTM encoder and a two-layer LSTM decoder
    • uses the guided attention trick (Tachibana et al., 2017), which provides incredibly fast convergence of the attention module (in our experiments we were unable to reach an acceptable model without this trick)
    • does not employ any CNN/pre-net or post-net
    • uses a simple highway connection from the attention to the output of the decoder (which we observed that forces the encoder to actually learn how to produce the mean-values of the mel-log spectrum for particular phones/characters)
  • The initail vocoder was similar to WaveRNN(Kalchbrenner et al., 2018), but instead of modifying the RNN cells (as proposed in their paper), we used two coupled neural networks
  • We are now using Clarinet (Ping et al., 2018)

References

The ParallelWavenet/ClariNet code is adapted from this ClariNet repo.

Comments
  • English model and hardware requirements

    English model and hardware requirements

    Hello Tiberiu,

    I'd love to test TTS-Cube, but unfortunately now i don't have access to a good GPU (and i don't think i could train a TTS on a laptop with a 940mx), do you have a pretrained english model? (it seems you were working on one, but i don't know the current status about that).

    Also, do you have an idea what could be the hardware requirements to run the synthesis? For example the nvidia jetson nano seems a nice platform to have a self-hosted TTS, but i'm not sure if it's powerful enough to run TTS-Cube.

    opened by prototux 26
  • Negative loss when training  step2

    Negative loss when training step2

    Here is my output of training step2

    /usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    Found 4995 training files and 5 development files
    	Rendering devset
    		1/5 processing file data/processed/dev/0000001 
    		2/5 processing file data/processed/dev/0000002 
    		3/5 processing file data/processed/dev/0000003 
    		4/5 processing file data/processed/dev/0000004 
    		5/5 processing file data/processed/dev/0000005 
    
    Starting epoch 1
    Shuffling training data
    	1/4995 processing file data/processed/train/0000007
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.29it/s]
     avg loss=0.9230358004570007 execution time=0.8138909339904785
    	2/4995 processing file data/processed/train/0000008
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
     avg loss=0.8607208132743835 execution time=0.714789867401123
    ...
    avg loss=0.1945137083530426 execution time=0.626471757888794
    	17/4995 processing file data/processed/train/0000022
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.46it/s]
     avg loss=0.0572626106441021 execution time=0.7513647079467773
    	18/4995 processing file data/processed/train/0000023
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.68it/s]
     avg loss=-0.061442214995622635 execution time=0.6261122226715088
    	19/4995 processing file data/processed/train/0000024
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
     avg loss=-0.18586862087249756 execution time=0.7162132263183594
    	20/4995 processing file data/processed/train/0000025
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.68it/s]
     avg loss=0.06383810192346573 execution time=0.6265075206756592
    	21/4995 processing file data/processed/train/0000026
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.67it/s]
     avg loss=-0.20782051980495453 execution time=0.628434419631958
    	22/4995 processing file data/processed/train/0000027
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.46it/s]
     avg loss=-0.31225016713142395 execution time=0.7171187400817871
    	23/4995 processing file data/processed/train/0000028
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.30it/s]
     avg loss=-0.5820147395133972 execution time=0.8073093891143799
    	24/4995 processing file data/processed/train/0000029
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.68it/s]
     avg loss=-0.46214190125465393 execution time=0.6245768070220947
    	25/4995 processing file data/processed/train/0000030
    100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.46it/s]
     avg loss=-0.7601555585861206 execution time=0.720339298248291
    	26/4995 processing file data/processed/train/0000031
    
    
    opened by shartoo 19
  • Fine-tuning/Speaker adaptation

    Fine-tuning/Speaker adaptation

    First off, this is great work. Can't wait to play around with the code 👍

    In the training instructions, I see that you do have multispeaker support. Is it possible to "fine-tune" from an existing checkpoint with another dataset using --resume? Has anyone tried it and see if the results are good?

    opened by ZohaibAhmed 15
  • Demo on Colab, possible improvements?

    Demo on Colab, possible improvements?

    Hi,

    I ran the updated English model in Google Colab, find the code here: https://colab.research.google.com/drive/1L6BYGA0CmQhF6FULWbVr4-yZeKtp4uPK

    The synthesis is good, but improvement could be in the pronunciation of the vowels, which sound as if they are underwater, and the consonants, which are a bit sharp. Any idea why this is, or how this could be solved?

    Thank you for creating this project, and making it available :)

    opened by roodrallec 11
  • what should the development set's content be in a speech dataset and g2p?

    what should the development set's content be in a speech dataset and g2p?

    this is my first github issue, so please forgive me if there are any mistakes. The problem i'm having right now though is simply not understanding what should be contained in a development folder of a training set What I've done. I've downloaded the M-AILABS italian training set, and have splitted the csv in txt files such as every one of them are corresponding to a wav file, and that's for the training set. My question is: what should i put in the other folder? The readme says that there should not be more than 5 files in there, but when i start training with an empty dev folder it gives me an error about a lab file that was not found. I have the same doubt about the g2p thing, but as i'm not going to use that feature that's a secondary thing for me, as well as adding custom things in the lab file which, in fact, i've not added any.

    opened by albluc24 5
  • some words are missing during synthesizing

    some words are missing during synthesizing

    I have trained an encoder on custom data in telugu language for about 4 days but during inference some words are not synthesized and the audio is skipping those words any do you suggest any hyperparameters adjustment or something else to make the synthesizer work correctly. I am using the ljspeech vocoder. and trained it on the last version of the repo before the new g2p pull. and the loss is around 1.8 to 2.3 it remained in that range for the past 20hrs. Thank you

    opened by saibharani 5
  • audio samples on English dataset

    audio samples on English dataset

    Hello,

    Thank you for the wonderful repository.

    I read that you're currently training on LJSpeech dataset for english TTS.

    Do you have any updates on audio samples?

    Also would you be able to provide some rough training stats (number of GPU used, hours need per pass through data, etc).

    Thanks again for the awesome repository and open-source effort.

    opened by G-Wang 3
  • what is the present inference for generating 10sec audio using vocoder?

    what is the present inference for generating 10sec audio using vocoder?

    I want to know the general inference time of generating 10sec audio using the vocoder presently i don't have the setup to test it on my own can anyone help thank you

    opened by saibharani 2
  • Is there any interest in providing a model trained in Brazilian Portuguese?

    Is there any interest in providing a model trained in Brazilian Portuguese?

    Hello, I recently developed a dataset for voice synthesis in Brazilian Portuguese using my own voice, called TTS-Porguese Corpus, the base has approximately 10 hours of talk, is available at: https://github.com/Edresson/TTS-Portuguese-Corpus

    I have already successfully trained the DCTTS model in the dataset.

    Is anyone interested in training/adjusting the hyperparameters of the model to get a TTS model in Portuguese?

    opened by Edresson 2
  • Add requirements.txt

    Add requirements.txt

    No idea what the dependencies are for this repo. Apparently it requires python 2 (due to the use of xrange) and is using an old version of scipy ("module scipy.misc has no attribute 'toimage'"). Also, did the folder structure change? Because trainer.py cannot find 'data/processed/train'. I had to manually correct it to '../data/processed/train' (many times because it is hardcoded all over the place).

    You may be on to something, but this repo is unusable by others as it is.

    opened by balbertalli 2
  • Bump pillow from 6.2.0 to 9.3.0

    Bump pillow from 6.2.0 to 9.3.0

    Bumps pillow from 6.2.0 to 9.3.0.

    Release notes

    Sourced from pillow's releases.

    9.3.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.3.0 (2022-10-29)

    • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

    • Initialize libtiff buffer when saving #6699 [radarhere]

    • Inline fname2char to fix memory leak #6329 [nulano]

    • Fix memory leaks related to text features #6330 [nulano]

    • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

    • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

    • Fixed set_variation_by_name offset #6445 [radarhere]

    • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

    • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

    • Added ExifTags enums #6630 [radarhere]

    • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

    • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

    • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

    • Added GPS TIFF tag info #6661 [radarhere]

    • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

    • Do not attempt normalization if mode is already normal #6644 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • melgan vocoder is fast, let's integrate it?

    melgan vocoder is fast, let's integrate it?

    Hello, I've did the required changes to integrate tts cube with melgan vocoder.

    The inference is really fast, with wavenet it took 1.5 hour to vocode a few sentences, now it takes literally seconds.

    I trained the melgan vocoder for 325 epochs (106170 iterations), it took a day I think and it's already understandable.

    Problem is encoder takes so long to train I'm training for days and days and it still says what it wants. I wish a more faster gpu.

    It speaks, just not exactly what is in the text file.

    The datasets are Japanese and Russian. I want to do a common (multi lingual) model in the future (just for fun).

    Is there an interest from others to reproduce my experiment on your dataset?? I can share my code.

    opened by neurlang 0
  • colab notebook missing command to enter the github folder

    colab notebook missing command to enter the github folder

    I tried using the colab, it works but there is an error in the paths.

    After installing the dependencies, code assumes it's in the the repository folder.

    add a line to enter the folder %cd TTS-Cube before !git submodule update --init --recursive

    or change the paths to include it

    opened by seranus 0
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 3, 2023
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 325 Jan 5, 2023
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 3, 2023
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 86 Jun 11, 2021
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

Ke Technologies 34 Sep 8, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 3, 2023
Rhasspy 673 Dec 28, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

Corentin Jemine 38.5k Jan 3, 2023