Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

Overview

Real-Time Voice Cloning

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

SV2TTS is a three-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model trained to generalize to new voices.

Video demonstration (click the picture):

Toolbox demo

Papers implemented

URL Designation Title Implementation source
1806.04558 SV2TTS Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis This repo
1802.08435 WaveRNN (vocoder) Efficient Neural Audio Synthesis fatchord/WaveRNN
1703.10135 Tacotron (synthesizer) Tacotron: Towards End-to-End Speech Synthesis fatchord/WaveRNN
1710.10467 GE2E (encoder) Generalized End-To-End Loss for Speaker Verification This repo

News

14/02/21: This repo now runs on PyTorch instead of Tensorflow, thanks to the help of @bluefish. If you wish to run the tensorflow version instead, checkout commit 5425557.

13/11/19: I'm now working full time and I will not maintain this repo anymore. To anyone who reads this:

  • If you just want to clone your voice (and not someone else's): I recommend our free plan on Resemble.AI. You will get a better voice quality and less prosody errors.
  • If this is not your case: proceed with this repository, but you might end up being disappointed by the results. If you're planning to work on a serious project, my strong advice: find another TTS repo. Go here for more info.

20/08/19: I'm working on resemblyzer, an independent package for the voice encoder. You can use your trained encoder models from this repo with it.

06/07/19: Need to run within a docker container on a remote server? See here.

25/06/19: Experimental support for low-memory GPUs (~2gb) added for the synthesizer. Pass --low_mem to demo_cli.py or demo_toolbox.py to enable it. It adds a big overhead, so it's not recommended if you have enough VRAM.

Setup

1. Install Requirements

Python 3.6 or 3.7 is needed to run the toolbox.

  • Install PyTorch (>=1.0.1).
  • Install ffmpeg.
  • Run pip install -r requirements.txt to install the remaining necessary packages.

2. Download Pretrained Models

Download the latest here.

3. (Optional) Test Configuration

Before you download any dataset, you can begin by testing your configuration with:

python demo_cli.py

If all tests pass, you're good to go.

4. (Optional) Download Datasets

For playing with the toolbox alone, I only recommend downloading LibriSpeech/train-clean-100. Extract the contents as /LibriSpeech/train-clean-100 where is a directory of your choosing. Other datasets are supported in the toolbox, see here. You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.

5. Launch the Toolbox

You can then try the toolbox:

python demo_toolbox.py -d
or
python demo_toolbox.py

depending on whether you downloaded any datasets. If you are running an X-server or if you have the error Aborted (core dumped), see this issue.

Comments
  • Training a new encoder model

    Training a new encoder model

    In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.

    • Increase the number of hidden layers to 768 as suggested here: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-529235670
    • All other hparams will be kept default
    • We will try to strictly follow the instructions for encoder training on the wiki page: wiki/Training

    Instructions

    1. Download the LibriSpeech/train-other-500, and VoxCeleb 1/2 datasets. Extract these to your <datasets_root> folder as follows:
      • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
      • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
      • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)
    2. Change model_hidden_size to 768 in encoder/params_model.py
    3. python encoder_preprocess.py <datasets_root>
    4. Open a separate terminal and start visdom
    5. python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder
    opened by ghost 113
  • Training from scratch

    Training from scratch

    Thanks for publishing the code and basic training instructions!

    Environment

    Datasets: (9,063 speakers)

    • LibriTTS (train-other-500)
    • VoxCeleb1
    • VoxCeleb2
    • OpenSLR (42-44, 61-66, 69-80)
    • VCTK

    I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.

    Hyper Parameters: Left all parameters untouched.

    Encoder training:

    39,300 steps: image

    115,900 steps: (almost exactly 24 hours of training) image

    Typical step

    Step 115950   Loss: 0.9941   EER: 0.0717   Step time:  mean:   889ms  std:  1320ms
    
    Average execution time over 10 steps:
      Blocking, waiting for batch (threaded) (10/10):  mean:  449ms   std: 1317ms
      Data to cuda (10/10):                            mean:    3ms   std:    0ms
      Forward pass (10/10):                            mean:    8ms   std:    2ms
      Loss (10/10):                                    mean:   67ms   std:    7ms
      Backward pass (10/10):                           mean:  237ms   std:   26ms
      Parameter update (10/10):                        mean:  118ms   std:    3ms
      Extras (visualizations, saving) (10/10):         mean:    6ms   std:   18ms
    

    Questions

    1. Will adding an additional ~2,900 speakers make much of a difference for the encoder?
      1. Will adding the remaining LibriTTS datasets (train-clean-100, train-clean-360, dev-clean, dev-other) with 1,221 speakers have any adverse effects training the synthesizer and vocoder?
    2. Does using different languages in the encoder help or hurt?
    3. Does my encoder training thus far look okay? It appears it will take me roughly 7 days to train the encoder up to 846,000 steps.
    4. Can I train the encoder using 16,000Hz while training the synthesizer and vocoder using 24,000Hz? Or do I need to restart and train the encoder on 24,000Hz mel spectrograms?
    5. I've downloaded the source videos for TEDLIUM-3 so I can extract audio at up to 44,100Hz allowing me to expand the synthesizer and vocoder training dataset to TEDLIUM + LibriTTS at 24,000Hz.
    6. Based on other issues I've read it appears you would like to use factchord taco1 implementation. Would you advice I go that route vs nvidia's taco2 pytorch implementation?
    opened by sberryman 105
  • Pytorch synthesizer

    Pytorch synthesizer

    Splitting this off from #370, which will remain for tensorflow2 conversion. I would prefer this route if we can get it to work. Asking for help from the community on this one.

    One example of a pytorch-based tacotron is: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

    Another option is to manually convert the code and pretrained models which would be extremely time-consuming, but also an awesome learning experience.

    dependencies 
    opened by ghost 74
  • Single speaker fine-tuning process and results

    Single speaker fine-tuning process and results

    Summary

    A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.

    Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.

    Procedure

    Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0

    1. First, create a dataset of a single speaker from LibriSpeech. All embeddings are updated to reference the same file. (I'm not sure if this helps or not, but the idea is to get it to converge faster.)
      • It doesn't have to be LibriSpeech. This demonstrates the concept with minimal changes to existing files.
      • Total of 13.28 minutes (train-clean-100/211/122425/*)
    2. Next, continue training of the pretrained synthesizer model using the restricted dataset. Running overnight on a CPU, loss decreased from 0.70 to 0.50 over 2,600 steps. I plan to go further in subsequent tests.
    3. Generate new training data for the vocoder using the updated synthesizer model.
    4. Continue training of the pretrained vocoder. I only added 1,000 steps for now because I was eager to see if it worked, but the difference is noticeable even with a little fine-tuning.

    Results

    Download audio samples: samples.zip

    These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.

    Discussion

    This work helps to demonstrate the following points:

    1. Deficiencies with the synthesizer and its pretrained model can be compensated to some extent, by fine-tuning to a single speaker. This is much easier than implementing a new synthesizer and requires far less training.
    2. A small dataset of 0.2 hours is sufficient for fine-tuning the synthesizer.
    3. Better single-speaker performance can be obtained with just a few thousand steps of additional synthesizer training.

    The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.

    Acknowledgements

    • @CorentinJ (for the toolbox and original models)
    • @matheusfillipe (for the #402 features which make the toolbox much more usable for these experiments)
    • @mbdash (for asking questions in #433 that inspired me to try this)
    • @plummet555 (for support on #384 to make the toolbox deterministic, helps a lot with benchmarking)
    • @pusalieth (for #331 to make toolbox work on CPU)
    opened by ghost 71
  • Training a new model based on LibriTTS

    Training a new model based on LibriTTS

    @blue-fish, Would it be useful if I was to offer a GPU (2080 ti) for contributing on training a new model based on LibriTTS ? I have yet to train any models and would gladly exchange GPU time for an opportunity to learn. I wonder how long it would take on a single 2080 ti.

    Originally posted by @mbdash in https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/441#issuecomment-663076421

    opened by ghost 66
  • Pytorch synthesizer

    Pytorch synthesizer

    I have taken the tacotron model from fatchord/WaveRNN and integrated it with this repo (#447). Aside from the new format of the synthesizer model (.pt) this change should be completely transparent to the end user.

    Major Changes

    • Toolbox no longer requires tensorflow 🎉
    • Synthesizer is tacotron1 instead of tacotron2

    Pretrained Model

    A download link and instructions are provided here: https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/472#issuecomment-695206377

    Task List

    • [x] Inference
    • [x] Training
    • [x] Update preprocessing scripts to use synthesizer_pt
    • [x] Cleanup files in synthesizer_pt
    • [x] Match repo code style
    • [x] Move synthesizer_pt to synthesizer (no more tensorflow)
    • [x] Testing
    • [x] Release pretrained model
    • [x] Update documentation
    • [x] Review
      • [x] Retest as needed if code changes made
    • [x] Merge into master branch of repo
    opened by ghost 55
  • Tensorflow 2.x implementation of synthesizer

    Tensorflow 2.x implementation of synthesizer

    As said in the issue « pytorch synthesizer », i’m trying to retrain a synthesizer in tensorflow 2.x (the model is inspired from NVIDIA’s pytorch implementation and is available on my github)

    Actually i made some tests and have some results (see below for results, tests and future experience).

    Data and hardware.

    For my tests, i try to train it in French with the dataset CommonVoice combined with SIWIS (total of 180k utterances for around 2k speakers) I use a GeForce GTX 1070 with 6.2Go of RAM

    Preprocessing and encoder

    As encoder, i use a siamese network trained on datasets described above The final siamese achieves a binary-accuracy of 97% (train and valid set), see the issue « alternative approach to encoder » to more details about the model, approach and results

    For preprocessing, i use default preprocessing used by the NVIDIA’s tacotron-2 implementation (as i made transfert-learning with this model to speed up training)

    Training procedure and parameters

    For training, i have to split the input spectrogram in sub-block of N frames because i don’t have enough memory to train on whole spectrogram in 1 step The training step is available on my github if you want to see it

    Hyperparameters are :

    • Batch size : 64 (graph mode) or 48 (eager mode)
    • N (nb frames / optimization step) : 15-20 (graph mode) and 40-50 (eager mode)
    • Dataset size : around 160 000 samples
    • Training / validation size : 90% of the dataset for training and 10% validation
    • Optimizer : Adam with epsilon 1e-3 and custom learning-rate scheduler (goeus to 0.00075 to 0.00025)
    • Loss : tacotron-loss (inspired by the NVIDIA repo) : sum of Masked-MSE for mel output, Masked-MSE for mel-postnet-output and BCE for gate output
    • Training time : around 15sec / step (graph mode) and 20sec / step in eager mode (1 batch on the entire spect) and around 11 hours for 1 epoch (training + validation)

    Note : graph mode is a specificity of tensorflow 2.x with the decorator tf.function, it is more memory efficient for this model but much faster so i make most of experiments in graph mode (i put only the call()method of the decoder in graph mode, the rest is in eager mode because it doesn’t work in graph mode)

    To compare the loss, i already trained a tacotron-2 model with this loss (single speaker) and wav becomes interesting with a loss of around 0.5 (mel-postnet-loss around 0.2)

    Results

    Siamese encoder (5 epoch, ~10k steps)

    • epoch 4 : loss decreases from 1.27 to 0.95
    • epoch 5 : loss decreases from 1.22 to 0.85

    Siamese encoder with additionnal dense

    Loss decreases to 1. but not decreases below (only trained for 2-3 epochs because i haven’t enough time...)

    Encoder of this repo

    • Loss decreases from 2.8 to 1.8 in epoch 1 (3k steps, batch_size 48 with 40 frames) (eager mode)

    Continue training with 15 frames and batch_size 64 (in graph mode) :

    • Epoch 2 : avg loss of 1.27
    • Epoch 3 : avg loss of 1.22 (min around 1.5)
    • Epoch 4 : loss is around 1.14 in first 500 steps

    Future experiments

    I think i will train the actual model for 2-3 epoch more and see results, actually the loss is still decreasing during epoch so i hope it will decreases below 0.7 and less in the future

    If it is not the case, here is a few ideas to improve the model :

    • [x] Add a Dense layer after the encoder-concatenation to reduce the embedding size, like that i can make a full transfert-learning with the pretrained model (actually i make a partial transfert learning because the RNN-decoder have different shapes because of the concatenation of the encoder output) With this full transfert learning, i could train only the encoder for few steps (to train the new Dense layer) and after that i can train the full model for epochs. The intuition is that the attention mechanism will be already learned and then the training should be much faster
    • [x] It can also be interesting to train the model with the speaker-embeddings embedded with the encoder of this repo (i didn’t do this yet because the embedding of my entire dataset takes so many times with this encoder)
    • [ ] Another thing to try could be to train a siamese encoder with embedding-dim 256 (actually, the embedding is 64-dim)
    • [ ] I could also try a siamese encoder trained on spectrogram instead of raw audio, it can mayby learn more information about frequencies that can help more the synthesizer

    If you have any ideas to improve the model / performances or if you want to use my model to make tests, you can post comments here !

    opened by Ananas120 47
  • Update on maintaining this project

    Update on maintaining this project

    We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that. I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate. I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

    opened by CorentinJ 47
  • Training Voice Cloning model for another language

    Training Voice Cloning model for another language

    Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?

    opened by rlutsyshyn 39
  • Slow Training GPU RTX 2080

    Slow Training GPU RTX 2080

    Hi, I am training a dataset in Portuguese but the process is very slow using CUDA with default hparams.

    In this test I'm using batch_size = 8.

    img1

    As Expected: Run 2~3 times faster

    Anyone has this problem on Windows with Anaconda?

    opened by MGSousa 38
  • Abort when python demo_toolbox.py

    Abort when python demo_toolbox.py

    Hi: I am trying to run your code on a centos server with X11 forwarding open. But when I try python demo_toolbox dataset , it prints

    Arguments:
        datasets_root:    dataset
        enc_models_dir:   encoder/saved_models
        syn_models_dir:   synthesizer/saved_models
        voc_models_dir:   vocoder/saved_models
    Aborted
    

    I believe I installed all required packages. Looks like the error is not caused by python but some low level call. So is there any way to print more error message? Or is there any way to run without GUI ? (I think although I open X11 forward on this server but it still might not fit as good as a pure GUI machine).

    Thanks!

    opened by Interfish 36
  • how I can train the model for Arabic language

    how I can train the model for Arabic language

    please I've tried to make the decoder and synthesizer to detect Arabic characters and I still not able to make it spelling well or read the arabic text well ! , what I have to do for adding Arabic language to Real-Time-Voice-Cloning , what I have to change also . thank your from now for help

    opened by baraalmasri 0
  • Dear Friends, I am facing segmentation fault

    Dear Friends, I am facing segmentation fault

    Dear Friends, I am facing segmentation fault Segmentation fault (core dumped) while using $ python demo_toolbox.py There is nothing which happens other than this fault, no window popup or any other error. I tried several python versions and also torch versions. Same result. I am using Centos 7. Earlier everything was fine. If anyone can help or hint me what could be the reason.

    Originally posted by @Tortoise17 in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/1143

    Originally posted by @ImanuillKant1 in https://github.com/HTTPS-PhoenixEnterprise-com/HTTPS-PhoenixEnterprise-com/issues/3

    opened by ImanuillKant1 0
Owner
Corentin Jemine
Machine learning engineer at Resemble.AI
Corentin Jemine
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 1, 2023
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 6, 2023
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

null 2 Jun 19, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

null 164 Jan 2, 2023
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022