IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

Overview

image

IMS-Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the Institute for Natural Language Processing (IMS), University of Stuttgart, Germany. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

The PyTorch Modules of Tacotron2 and FastSpeech2 are taken from ESPnet, the PyTorch Modules of HiFiGAN are taken from the ParallelWaveGAN repository which are also authored by the brilliant Tomoki Hayashi.

For a version of the toolkit that includes TransformerTTS instead of Tacotron2 and MelGAN instead of HiFiGAN, check out the TransformerTTS and MelGAN branch. They are separated to keep the code clean, simple and minimal.

Demonstration

Here are two sentences produced by Tacotron 2 combined with HiFi-GAN, trained on Nancy Krebs using this toolkit.

Here is some speech produced by FastSpeech2 and MelGAN trained on LJSpeech using this toolkit.

And here is a sentence produced by TransformerTTS and MelGAN trained on Thorsten using this toolkit.

Here is some speech produced by a multi-speaker FastSpeech2 with MelGAN trained on LibriTTS using this toolkit. Fans of the videogame Portal may recognize who was used as the reference speaker for this utterance.


Installation

To install this toolkit, clone it onto the machine you want to use it on (should have at least one GPU if you intend to train models on that machine. For inference, you can get by without GPU). Navigate to the directory you have cloned and run the command shown below. It is recommended to first create and activate a pip virtual environment .

pip install -r requirements.txt 

If you want to use multi-speaker synthesis, you will need a speaker embedding function. The one assumed in the code is dvector, because it is incredibly easy to use and freely available. In the current version of the toolkit it is included by default and should require no further action.

And finally you need to have espeak installed on your system, because it is used as backend for the phonemizer. If you replace the phonemizer, you don't need it. On most Linux environments it will be installed already, and if it is not, and you have the sufficient rights, you can install it by simply running

apt-get install espeak

Creating a new Pipeline

To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a Tacotron2 you need audio files and corresponding text labels. To create a new pipeline for a FastSpeech2, you need audio files, corresponding text labels, and an already trained Tacotron2 model to estimate the duration information that FastSpeech 2 needs as input. Let's go through them in order of increasing complexity.

Build a HiFiGAN Pipeline

In the directory called Utility there is a file called file_lists.py. In this file you should write a function that returns a list of all the absolute paths to each of the audio files in your dataset as strings.

Then go to the directory TrainingInterfaces/TrainingPipelines. In there, make a copy of any existing pipeline that has HiFiGAN in its name. We will use this as reference and only make the necessary changes to use the new dataset. Import the function you have just written as get_file_list. Now look out for a variable called model_save_dir. This is the default directory that checkpoints will be saved into, unless you specify another one when calling the training script. Change it to whatever you like.

Now you need to add your newly created pipeline to the pipeline dictionary in the file run_training_pipeline.py in the top level of the toolkit. In this file, import the run function from the pipeline you just created and give it a speaking name. Now in the pipeline_dict, add your imported function as value and use as key a shorthand that makes sense. And just like that you're done.

Build a Tacotron2 Pipeline

In the directory called Utility there is a file called path_to_transcript_dicts.py. In this file you should write a function that returns a dictionary that has all the absolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the corresponding audios as the values.

Then go to the directory TrainingInterfaces/TrainingPipelines. In there, make a copy of any existing pipeline that has Tacotron2 in its name. If your dataset is single-speaker, choose any that is not LibriTTS. If your dataset is multi-speaker, choose the one for LibriTTS as your template. We will use this copy as reference and only make the necessary changes to use the new dataset. Import the function you have just written as build_path_to_transcript_dict. Since the data will be processed a considerable amount, a cache will be built and saved as file for quick and easy restarts. So find the variable cache_dir and adapt it to your needs. The same goes for the variable save_dir, which is where the checkpoints will be saved to. This is a default value, you can overwrite it when calling the pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a different directory.

Since we are using text here, we have to make sure that the text processing is adequate for the language. So check in Preprocessing/TextFrontend whether the TextFrontend already has a language ID (e.g. 'en' and 'de') for the language of your dataset. If not, you'll have to implement handling for that, but it should be pretty simple by just doing it analogous to what is there already. Now back in the pipeline, change the lang argument in the creation of the dataset and in the call to the train loop function to the language ID that matches your data.

Now navigate to the implementation of the train_loop that is called in the pipeline. In this file, find the function called plot_attention. This function will produce attention plots during training, which is the most important way to monitor the progress of the training. In there, you may need to add an example sentence for the language of the data you are using. It should all be pretty clear from looking at it.

Once this is done, we are almost done, now we just need to make it available to the run_training_pipeline.py file in the top level. In said file, import the run function from the pipeline you just created and give it a speaking name. Now in the pipeline_dict, add your imported function as value and use as key a shorthand that makes sense. And that's it.

Build a FastSpeech2 Pipeline

Most of this is exactly analogous to building a Tacotron2 pipeline. So to keep this brief, this section will only mention the additional things you have to do.

In your new pipeline file, look out for the line in which the acoustic_model is loaded. Change the path to the checkpoint of a Tacotron2 model that you trained on the same dataset previously. This is used to estimate phoneme-durations based on knowledge-distillation.

Everything else is exactly like creating a Tacotron2 pipeline, except that in the training_loop, instead of attentions plots, spectrograms are plotted to visualize training progress. So there you may need to add a sentence if you are using a new language in the function called plot_progress_spec.

Training a Model

Once you have a pipeline built, training is super easy. Just activate your virtual environment and run the command below. You might want to use something like nohup to keep it running after you log out from the server (then you should also add -u as option to python) and add an & to start it in the background. Also, you might want to direct the std:out and std:err into a file using > but all of that is just standard shell use and has nothing to do with the toolkit.

python run_training_pipeline.py 

You can supply any of the following arguments, but don't have to (although for training you should definitely specify at least a GPU ID).

--gpu_id  

--resume_checkpoint 

--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline)

--model_save_dir 

After every epoch, some logs will be written to the console. If the loss becomes NaN, you'll need to use a smaller learning rate or more warmup steps in the arguments of the call to the training_loop in the pipeline you are running.

If you get cuda out of memory errors, you need to decrease the batchsize in the arguments of the call to the training_loop in the pipeline you are running. Try decreasing the batchsize in small steps until you get no more out of cuda memory errors. Decreasing the batchsize may also require you to use a smaller learning rate. The use of GroupNorm should make it so that the training remains mostly stable.

Speaking of plots: in the directory you specified for saving model's checkpoint files and self-explanatory visualization data will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. Training will stop after 100,000 update steps have been made by default for Tacotron2, 300,000 for FastSpeech2, and after 500,000 steps for HiFiGAN. Depending on the machine and configuration you are using this will take between 2 and 4 days, so verify that everything works on small tests before running the big thing. If you want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with it.

After training is complete, it is recommended to run run_weight_averaging.py. If you made no changes to the architectures and stuck to the default directory layout, it will automatically load any models you produced with one pipeline, average their parameters to get a slightly more robust model and save the result as best.pt in the same directory where all the corresponding checkpoints lie. This also compresses the file size slightly, so you should do this and then use the best.pt model for inference.

Creating a new InferenceInterface

To build a new InferenceInterface, which you can then use for super simple inference, we're going to use an existing one as template again. If you use multi-speaker, take the LibriTTS ones as template, otherwise take any other one. Make a copy of the InferenceInterface. Change the name of the class in the copy and change the paths to the models to use the trained models of your choice. Instantiate the model with the same hyperparameters that you used when you created it in the corresponding training pipeline. The last thing to check is the language that you supply to the text frontend. Make sure it matches what you used during training.

With your newly created InferenceInterface, you can use your trained models pretty much anywhere, e.g. in other projects. All you need is the Utility directory, the Layers directory, the Preprocessing directory and the InferenceInterfaces directory (and of course your model checkpoint). That's all the code you need, it works standalone.

Using a trained Model for Inference

An InferenceInterface contains 2 useful methods. They are read_to_file and read_aloud.

  • read_to_file takes as input a list of strings and a filename. It will synthesize the sentences in the list and concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument.

  • read_aloud takes just a string, which it will then convert to speech and immediately play using the system's speakers. If you set the optional argument view to True when calling it, it will also show a plot of the phonemes it produced, the spectrogram it came up with, and the wave it created from that spectrogram. So all the representations can be seen, text to phoneme, phoneme to spectrogram and finally spectrogram to wave.

  • Additionally, Tacotron2 InferenceInterfaces offer a method called plot_attention. This will take a string, synthesize it and show a plot of the attention matrix, which can be useful to gain insights.

Those methods are used in demo code in the toolkit. In run_interactive_demo.py and run_text_to_file_reader.py, you can import InferenceInterfaces that you created and add them to the dictionary in each of the files with a shorthand that makes sense. In the interactive demo, you can just call the python script, then type in the shorthand when prompted and immediately listen to your synthesis saying whatever you put in next (be wary of out of memory errors for too long inputs). In the text reader demo script you have to call the function that wraps around the InferenceInterface and supply the shorthand of your choice. It should be pretty clear from looking at it.

FAQ

Here are a few points that were brought up by users:

  • My error message shows GPU0, even though I specified a different GPU - The way GPU selection works is that the specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually running on the GPU you specified.
  • I'm getting device side assert triggered errors - The amount of phonemes in the phoneme set used has to be specified as idim in the instantiation of a model. If a phoneme index is passed to the model during runtime which is higher than the amount specified as idim, this is the cryptic error that will occur. So if you make changes to the phoneme set, remember to also change the models idim.

Example Pipelines available

Dataset Language Single or Multi TransformerTTS Tacotron2 FastSpeech2
Thorsten German Single Speaker
LJSpeech English Single Speaker
Nancy Krebs English Single Speaker
LibriTTS English Multi Speaker

This toolkit has been written by Florian Lux (except for the pytorch modules taken from ESPnet and ParallelWaveGAN, as mentioned above), so if you come across problems or questions, feel free to write a mail. Also let me know if you do something cool with it. Thank you for reading.

Comments
  • Getting

    Getting "BrokenPipeError", "ConnectionResetError", and "EOFError" errors for hifigan training.

    I've put wav files at 48k SR into a folder using the following script: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/create_vocoder_files.py

    I am using the following to call the training: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/HiFiGAN_combined.py

    At preprocessing 55% it crashes, and I don't see what the cause is in the error log.

    hifigan-crash.log

    Assistance appreciated.

    python=3.8.12

    Python Environment: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/environment.yml

    feature request 
    opened by michael-conrad 17
  • Creating new text to IPA encoder. Does the existing model setup have place holders for IPA tone markers?

    Creating new text to IPA encoder. Does the existing model setup have place holders for IPA tone markers?

    Looking to take advantage of the wonder work y'all have done.

    In regards to creating a new text to IPA encoder. Does the existing model embedding have place holders for the full IPA character setup including the IPA standard tone markers?

    feature request 
    opened by michael-conrad 14
  • Adding a New Language

    Adding a New Language

    First of all, many thanks for a great repo. I'm kind of new to this stuff, please forgive me. Can we train a new speaker and language using this repo, for example Turkish. I would be very grateful if you could provide information on what the structure of the data set should be and how it was prepared.

    opened by Winchester37 11
  • Is there a way to change Speaker Embedding layer to other Models

    Is there a way to change Speaker Embedding layer to other Models

    Hi is there any chance we can change the Speaker Embedding layer from current Speechbrain's ECAPA-TDNN and Speechbrain's x-Vector to some other models like the Speaker Embedding model from Coqui TTS

    With the current model sometimes the output voice gender is Female even when the input reference audio gender is Male. So I want to try some other Speaker Embedding models too.

    And do we need to change the Sample rate of the reference file to 16K before passing to tts.set_utterance_embedding(path_to_reference_audio=reference) function?

    Thanks

    opened by saibharani 11
  • IndexError: index -1 is out of bounds for axis 0 with size 0

    IndexError: index -1 is out of bounds for axis 0 with size 0

    I am getting an index error after several rounds of training a new model.

    Any suggestions?

    Prepared a FastSpeech dataset with 2756 datapoints in Corpora/chr-w. Training model Reloading checkpoint_126775.pt 0%| | 0/275 [00:00<?, ?it/s]

    /home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 275/275 [00:29<00:00,  9.26it/s]
    Traceback (most recent call last):
      File "run_training_pipeline.py", line 78, in <module>
        pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
      File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/TrainingPipelines/FastSpeech2_Cherokee_West.py", line 45, in run
        train_loop(net=model,
      File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 195, in train_loop
        plot_progress_spec(net, device, save_dir=save_directory, step=step_counter, lang=lang, default_emb=default_embedding)
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
        return func(*args, **kwargs)
      File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 57, in plot_progress_spec
        lbd.specshow(spec,
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/librosa/display.py", line 959, in specshow
        kwargs.setdefault("cmap", cmap(data))
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/librosa/display.py", line 576, in cmap
        min_val, max_val = np.percentile(data, [min_p, max_p])
      File "<__array_function__ internals>", line 5, in percentile
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
        return _quantile_unchecked(
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
        r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
        r = func(a, **kwargs)
      File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 4050, in _quantile_ureduce_func
        n = np.isnan(ap[-1])
    IndexError: index -1 is out of bounds for axis 0 with size 0
    
    opened by michael-conrad 10
  • Will increasing the Duration, Pitch and Energy layers help improve quality

    Will increasing the Duration, Pitch and Energy layers help improve quality

    Hi I am curious if we increase the number of layers for the duration, pitch, and energy using the 'duration_predictor_layers' parameter and some other parameters in the architecture, will it improve the duration and pitch predictions accuracy closer to the audio in the given embeddings sample?

    If it does can you suggest some of the parameters that I can tweak if I want to train a bigger and better model.

    Thanks

    opened by saibharani 9
  • How is the new PortaSpeech Implementation performing?

    How is the new PortaSpeech Implementation performing?

    Hi I noticed that you are working on new PortaSpeech Implementation, Can I know how the model is performing, is the implementation completed? Can I try training with my own data?

    Thanks

    opened by bharaniyv 7
  • Question about dataset

    Question about dataset

    Hi,

    In the paper 5 minutes of speech are used to train on a new language. But for finetuning the Meta model on an already seen language (say French) is it worth it providing hours of single speaker audio ? I mean will the quality of the model improve when providing more than 5 minutes of audio ? I tried to provide a home made 75 minute dataset but I still could not recognize the speaker in the generated audio after finetuning up to 120k steps, although the prosody was awesome!

    And regarding speaker transfer (see interactive demo) because you output 44kHz audio, should the input audio (reference audio) also be 44 kHz ? I tried with 16 kHz audios but could not recognize the reference speaker from the generated output, although as mentioned earlier prosody was pretty good.

    So to summarize :

    1. how long the dataset should be ?
    2. which framerate should it have ?
    3. how long should each sample be ? I understand from this chat that adding longer samples enhanced the quality of the model. But to what extent, should I add 20 s, 30 s, 60 s audios in my dataset ?

    Thank you very much for your help :smile:

    opened by Ca-ressemble-a-du-fake 6
  • There is a problem with fine-tuning HiFiGAN

    There is a problem with fine-tuning HiFiGAN

    Hi, thank you for your contribution. I have a problem with fine-tuning HiFiGAN based on the pre-training model you gave me: hifigan_train_loop.py error on line 71 of check_dict "generator_optimizer" not found. I checked your release of v2. For HiFiGAN model version 2, only check_dict ["generator"] can be obtained. There is no way to get resources like "generator_optimizer" and "discriminator_optimizer". How can I solve this problem?

    Here is the error code in hifigan_train_loop.py:

    if path_to_checkpoint is not None: check_dict = torch.load(path_to_checkpoint, map_location=device) optimizer_g.load_state_dict(check_dict["generator_optimizer"]) optimizer_d.load_state_dict(check_dict["discriminator_optimizer"]) scheduler_g.load_state_dict(check_dict["generator_scheduler"]) scheduler_d.load_state_dict(check_dict["discriminator_scheduler"]) g.load_state_dict(check_dict["generator"]) d.load_state_dict(check_dict["discriminator"]) step_counter = check_dict["step_counter"]

    opened by guo453585719 6
  • Overfitting - how to detect and stop training?

    Overfitting - how to detect and stop training?

    I have an issue with overfitting on the data which seems to degrade the Cherokee portion of the output.

    The Cherokee output starts dropping trailing syllables that start with an 'h' in later iterations, which are rendered OK in earlier iterations.

    I've been trying higher iterations to get better voice matching between samples and model for data set specific voices.

    Is there a way to get the loss on a per language basis?

    I'm currently retraining the aligner with the Cherokee audio sourced from tape removed. I will then train the TTS again to see if that helps any.

    opened by michael-conrad 6
  • Inference Performance

    Inference Performance

    FastPitch, FastSpeech2 and Avocodo all claim inference times faster than 100x real time, sometimes way more (FastPitch), on GPU.

    When doing inference on an Nvidia A100, I barely get above 1x (2.6 seconds to generate 3 seconds of audio), measured with python's time.perf_counter() on the FastSpeech2Interface.forward() function (so model loading or wav file writing are not taken into account).

    More precisely, those are the results (which stay roughly the same across several runs) :

    Text2phone done in 0.0018 seconds.
    Phone2mel done in 0.6225 seconds.
    Mel2wav done in 1.8307 seconds.
    Enhancement done in 0.1348 seconds.
    

    Do other people get similar results or not with different GPUs and what could be the explanation ? Are papers inflating their inference real-case end-to-end performance, is it the implementation (did people compare with other FastSpeech2 / HifiGAN implementations), or is it a bug way lower level like cuda version related ?

    I am on the v2.3 release.

    Any help greatly appreciated, thanks :)

    opened by tomschelsen 5
  • Suggestion for the French language

    Suggestion for the French language

    Disclaimer : this might come down to geo/political preferences, but I thought the information is worth sharing.

    When listening to the Meta model in the French language, I found some pronunciation/accent surprising (I am a French native speaker from France). Looking at the code, it seems the Meta model is mainly trained on the CSS10_fr dataset. So I went to listen to this dataset and indeed found the same "suprises" I heard before. Information on the internet is scarce, but it seems the speaker might be from Canada. So, although being definitely French in the grammatical sense, the pronunciation is more fr-ca than fr-fr.

    So if by any chance you intended to retrain the Meta model, and are ok to make the French language + French accent sounding more "French from France", I would suggest avoiding CSS10 and using the Synpaflex Corpus ( https://www.ortolang.fr/market/corpora/synpaflex-corpus , I have seen reference to a "FrenchExpressive" corpus in the code, don't know if it is this one) or the SIWIS dataset ( https://datashare.ed.ac.uk/handle/10283/2353 , at least the subparts that are segmented).

    Here are the loading functions I made (the Synpaflex one only take the utterances for which the text was normalized by the authors, but an extended version could easily be made) :

    import glob
    from pathlib import Path
    import os
    
    def build_path_to_transcript_dict_synpaflex_norm_subset():
        root = "/data/inputs/speech_natural/fr/synpaflex-corpus/5/v0.1/"
        path_to_transcript = dict()
        for text_path in glob.iglob(os.path.join(root, "**/*_norm.txt"), recursive=True):
            with open(text_path, "r", encoding="utf8") as file:
                norm_transcript = file.read()
            path_obj = Path(text_path)
            wav_path = str((path_obj.parent.parent / path_obj.name[:-9]).with_suffix(".wav"))
            if Path(wav_path).exists():
                path_to_transcript[wav_path] = norm_transcript
        return path_to_transcript
    
    def build_path_to_transcript_dict_siwis_subset():
        root = "/data/inputs/speech_natural/fr/SiwisFrenchSpeechSynthesisDatabase/"
        # part4 and part5 are not segmented
        sub_dirs = ["part1", "part2", "part3"]
        path_to_transcript = dict()
        for sd in sub_dirs:
            for text_path in glob.iglob(os.path.join(root, "text", sd, "*.txt")):
                with open(text_path, "r", encoding="utf8") as file:
                    norm_transcript = file.read()
                path_obj = Path(text_path)
                wav_path = str((path_obj.parent.parent.parent / "wavs" / sd / path_obj.stem).with_suffix(".wav"))
                if Path(wav_path).exists():
                    path_to_transcript[wav_path] = norm_transcript
        return path_to_transcript
    
    opened by tomschelsen 1
  • Spectogram Loss Value is NaN

    Spectogram Loss Value is NaN

    I'm trying to do some training and found that the spectrogram loss is NaN. After reading again I found in the section https://github.com/DigitalPhonetics/IMS-Toucan#faq-:~:text=Loss%20turns%20to,use%20for%20TTS. that I should try using the scorer. I do it like this:

    1. Running python3 run_training_pipeline.py integration_test --gpu_id 0, but even now the result is still NaN and I can't find the file best.py
    2. After that I ran python3 run_scorer.py

    Is this step correct? I'm trying to run this using 1000 LJ Speech data. What should I do so that the spectrogram loss value is not NaN? For information, I’m using batch size: 8 and lr=0.001

    opened by kin0303 5
  • Multi-Speaker Training

    Multi-Speaker Training

    Hi, I want to do multi-speaker training. I have data for 4 speakers and one of the speakers has a very large data of approximately 20,000 audio files with a duration of time of 1-11 seconds while the other speakers only have approximately 1000 audio files. Should the speaker with 20,000 audio files be reduced so that the dataset used is balanced? And should I change the sample rate to 16,000? My Dataset sample rate is 44,100.

    opened by kin0303 13
  • Option to generate audio file to hear how the training evolves

    Option to generate audio file to hear how the training evolves

    Hi,

    I haven't found the option to generate audio files every now and then to check whether the training is evolving and to prevent overfitting.

    In weight and bias website or on disk only the mel spectrograms are available. I find it great if it was also possible to have audio files of the test sentences.

    If it slows down the training too much then an option should enable the generation of audio.

    I know that I can workaround this lack by merging the last checkpoints and then loading the checkpoint to infer the test sentences, but I find this process cumbersome. And this sometimes causes the training to stop (maybe because of out of memory error).

    If needed I can help implement this feature!

    feature request 
    opened by Ca-ressemble-a-du-fake 2
  • Can I improve the generated output naturalness ?

    Can I improve the generated output naturalness ?

    Hi,

    I have been playing around with Toucan TTS for some times and it is really easy to use and training is fast. I finetuned the provided Meta pretrained model with a 8 hour dataset and the result is not as good as I was expecting. So I wonder if I could make it even better or if you could help me spot where the "problem" lies in the generated audio :

    Here are the waveforms (top is Coqui VITS with 260k step trained from scratch model, bottom is Toucan FastSpeech2 with 200k step trained from Meta model) : ToucanVsCoquiWaveforms

    The associated spectrograms : ToucanVsCoquiSpectrograms

    And the audios :

    This is from Coqui VITS, I find it crystal clear : https://user-images.githubusercontent.com/91517923/202889766-0c2ad9ad-2ec2-4376-9abc-17a008e58364.mp4

    This is from FastSpeech2. It sounds as on old tapes, the voice is like shivering (I don't know it that's the right terms!) https://user-images.githubusercontent.com/91517923/202889734-3a02486d-3785-4e83-8365-614c6ac0f64f.mp4

    Both generated audios have been compressed to mp4 to be able to post them, but they are pretty close to what the wavs sound like (to my hearing there is no difference).

    So how can I make Toucan FastSpeech2 model sound better ? Should I train it some more steps or should on the contrary is it over-trained / over-fitted ? Or the only way would be to implement VITS in Toucan (I don't think it is straight forward to do) ?

    Thank you in advance for helping me improve the results!

    opened by Ca-ressemble-a-du-fake 4
Releases(v2.3)
  • v2.3(Oct 25, 2022)

    This release extends the toolkits functionality and provides new checkpoints.

    • self contained embeddings: we no longer use an external embedding model for TTS conditioning. Instead we train one that is specifically tailored for this use.
    • new vocoder: Avocodo replaces HiFi-GAN
    • new controllability options through artificial speaker generation
    • quality of life changes, such as weights&biases integration, a graphic demo script and automated model downloading
    • divese bugfixes and speed increases

    This release breaks backwards compatibility, please download the new models or stick to a prior release if you rely on your old models.

    Source code(tar.gz)
    Source code(zip)
    aligner.pt(210.42 MB)
    Avocodo.pt(52.88 MB)
    embedding_function.pt(1.83 MB)
    embedding_gan.pt(925.41 KB)
    FastSpeech2_Meta.pt(180.78 MB)
  • v2.2(May 20, 2022)

    This release extends the toolkits functionality and provides new checkpoints.

    New Features:

    • support for all phonemes in the IPA standard through an extended lookup of articulatory features
    • support for some suprasegmental markers in the IPA standard through parsing (tone, lengthening, primary stress)
    • praat-parselmouth for greatly improved pitch extraction
    • faster phonemizaton
    • word boundaries are added, which are invisible to the aligner and the decoder, but can help the encoder in multilingual scenarios
    • tonal languages added, tested and included into the pretraining (Chinese, Vietnamese)
    • Scorer class to inspect data given a trained model and dataset cache (provided pretrained models can be used for this)
    • intuitive controls for scaling durations and variance in pitch and energy
    • divese bugfixes and speed increases

    Note:

    • This release breaks backwards compatibility. Make sure you are using the associated pretrained models. Old checkpoints and dataset caches become incompatible. Only HiFiGAN remains compatible.
    • Work on upcoming releases is already in progress. Improved voice adaptation will be our next goal.
    • To use the pretrained checkpoints, download them, create their corresponding directories and place them into your clone as follows (you have to rename the HiFiGAN and FastSpeech2 checkpoints once in place):
    ...
    Models
    └─ Aligner
          └─ aligner.pt
    └─ FastSpeech2_Meta
          └─ best.pt
    └─ HiFiGAN_combined
          └─ best.pt
    ...
    
    
    Source code(tar.gz)
    Source code(zip)
    aligner.pt(210.42 MB)
    FastSpeech2_MetaLearningCheckpoint.pt(179.81 MB)
    HiFiGAN.pt(52.89 MB)
  • v2.1(Mar 1, 2022)

    • self contained aligner to get high quality durations quickly and easily without reliance on external tools or knowledge distillation
    • modelling speakers and languages jointly but disentangled, so you can use speakers across languages
    • look at the demo section for an interactive online demo

    Pretrained FastSpeech2 model that can speak in many languages in any voices, HiFiGAN model and Aligner model are attached to this commit.

    Source code(tar.gz)
    Source code(zip)
    aligner.pt(210.42 MB)
    HiFiGAN.pt(52.89 MB)
    MultiLingualMultiSpeakerFastSpeech2.pt(179.81 MB)
  • v1.1(Feb 28, 2022)

  • v1.0(Jan 14, 2022)

    The basic version of Tacotron 2, FastSpeech 2 and HiFiGAN are complete. A pretrained model for HiFiGAN is attached to this release.

    Future updates will include different models and new features and changes to existing models which will break backwards compatibility. This version is the most basic, but complete.

    Source code(tar.gz)
    Source code(zip)
    hifigan_checkpoint.pt(52.89 MB)
Owner
Digital Phonetics at the University of Stuttgart
Research institute in the field of speech, language and machine learning
Digital Phonetics at the University of Stuttgart
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 6, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

null 475 Jan 4, 2023
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 40.9k Feb 18, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 10k Feb 18, 2021
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 4.3k Feb 18, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 1.9k Feb 18, 2021
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021