A multi-voice TTS system trained with an emphasis on quality

Overview

TorToiSe

Tortoise is a text-to-speech program built with the following priorities:

  1. Strong multi-voice capabilities.
  2. Highly realistic prosody and intonation.

This repo contains all the code needed to run Tortoise TTS in inference mode.

What's in a name?

I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.

Demos

See this page for a large list of example outputs.

Usage guide

Colab

Colab is the easiest way to try this out. I've put together a notebook you can use here: https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing

Installation

If you want to use this on your own computer, you must have an NVIDIA GPU. Installation:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install -r requirements.txt

do_tts.py

This script allows you to speak a single phrase with one or more voices.

python do_tts.py --text "I'm going to speak this" --voice dotrice --preset fast

read.py

This script provides tools for reading large amounts of text.

python read.py --textfile <your text to be read> --voice dotrice

This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.

Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py with the --regenerate argument.

API

Tortoise can be used programmatically, like so:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", reference_clips, preset='fast')

Voice customization guide

Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.

These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.

Provided voices

This repo comes with several pre-packaged voices. You will be familiar with many of them. :)

Most of the provided voices were not found in the training set. Experimentally, it seems that voices from the training set produce more realistic outputs then those outside of the training set. Any voice prepended with "train" came from the training set.

Adding a new voice

To add new voices to Tortoise, you will need to do the following:

  1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
  2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
  3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
  4. Create a subdirectory in voices/
  5. Put your clips in that subdirectory.
  6. Run tortoise utilities with --voice=<your_subdirectory_name>.

Picking good reference clips

As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:

  1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
  2. Avoid speeches. These generally have distortion caused by the amplification system.
  3. Avoid clips from phone calls.
  4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
  5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
  6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.

Advanced Usage

Generation settings

Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with these settings (and it's very likely that I missed something!)

These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See api.tts for a full list.

Playing with the voice latent

Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.

This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. You could also theoretically build a small extension to Tortoise that gradually shifts the latent from one speaker to another, then apply it across a bit of spoken text (something I havent implemented yet, but might get to soon!) I am sure there are other interesting things that can be done here. Please let me know what you find!

Send me feedback!

Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here, please report it to me! I would be glad to publish it to this page.

Tortoise-detect

Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip came from Tortoise.

This classifier can be run on any computer, usage is as follows:

python is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>

This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives.

Model architecture

Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

Training

These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own DLAS trainer.

I currently do not have plans to release the training configurations or methodology. See the next section..

Ethical Considerations

Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how.

After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:

  1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
  2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
  3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
  4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See tortoise-detect above.
  5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.

Diversity

The diversity expressed by ML models is strongly tied to the datasets they were trained on.

Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents.

Looking forward

Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.

I want to mention here that I think Tortoise could do be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS.

The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer. Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.

If you are an ethical organization with computational resources to spare interested in seeing what this model could do if properly scaled out, please reach out to me! I would love to collaborate on this.

Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:

  • Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
  • Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
  • Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
  • Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
  • lucidrains who writes awesome open source pytorch models, many of which are used here.
  • Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.

Notice

Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.

Comments
  • Error running demonstration: TypeError: __init__() got multiple values for argument 'enabled'

    Error running demonstration: TypeError: __init__() got multiple values for argument 'enabled'

    I sure think I got everything installed and ready to work with my 3090, but when I try to run python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast I receive Generating autoregressive samples.. 100%|█████████████████████████████████████████████| 6/6 [00:03<00:00, 1.63it/s] Computing best candidates using CLVP 0%| | 0/6 [00:00<?, ?it/s]/home/al/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") 100%|█████████████████████████████████████████████| 6/6 [00:00<00:00, 6.15it/s] Transforming autoregressive outputs into audio.. 0%| | 0/80 [00:00<?, ?it/s] Traceback (most recent call last): File "tortoise/do_tts.py", line 37, in <module> gen, dbg_state = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents, File "/home/al/tortoise-tts/tortoise/api.py", line 325, in tts_with_preset return self.tts(text, **settings) File "/home/al/tortoise-tts/tortoise/api.py", line 488, in tts mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning, File "/home/al/tortoise-tts/tortoise/api.py", line 158, in do_spectrogram_diffusion mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise, File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 565, in p_sample_loop for sample in self.p_sample_loop_progressive( File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 611, in p_sample_loop_progressive out = self.p_sample( File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 514, in p_sample out = self.p_mean_variance( File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 1121, in p_mean_variance return super().p_mean_variance(self._wrap_model(model), *args, **kwargs) File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 340, in p_mean_variance model_output = model(x, self._scale_timesteps(t), **model_kwargs) File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/utils/diffusion.py", line 1220, in __call__ return self.model(x, new_ts, **kwargs) File "/home/al/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/al/.local/lib/python3.8/site-packages/TorToiSe-2.4.2-py3.8.egg/tortoise/models/diffusion_decoder.py", line 306, in forward with autocast(x.device.type, enabled=self.enable_fp16 and i != 0): TypeError: __init__() got multiple values for argument 'enabled' and I'm a bit stumped.

    opened by HandsomeDevilv112 13
  • add web demo to Huggingface

    add web demo to Huggingface

    Hi, would you be interested in adding tortoise-tts web demo to Hugging Face using Gradio? I see there is already models setup on Huggingface for this repo https://huggingface.co/jbetker

    here is a guide for adding spaces to your org or username

    How to add a Space: https://huggingface.co/blog/gradio-spaces

    Example spaces with repos: github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/salesforce/BLIP

    github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

    a Gradio Demo can be setup in 2 lines of code using the inference api (if enabled) integration through huggingface

    import gradio as gr
    gr.Interface.load("huggingface/jbetker/tortoise-tts-v2").launch()
    

    would launch the demo

    Please let us know if you would be interested and if you have any questions.

    opened by AK391 10
  • Problems running the program. (Failed to import soundfile)

    Problems running the program. (Failed to import soundfile)

    I've had a lot of trouble trying to get this to work, I have some experience running programs like this through anaconda prompt but for some reason I can't get this to work. This is the error I have currently. I've installed all the listed dependencies and have a feeling I'm missing something very obvious, if anyone can help me I would really appreciate it. My OS is windows 11.

    (base) C:\Users\Mok\anaconda3\envs\tortoise-tts>python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast Traceback (most recent call last): File "C:\Users\Mok\anaconda3\envs\tortoise-tts\tortoise\do_tts.py", line 27, in tts = TextToSpeech(models_dir=args.model_dir) File "C:\Users\Mok\anaconda3\envs\tortoise-tts\tortoise\api.py", line 242, in init self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir))) File "C:\Users\Mok\anaconda3\lib\site-packages\torch\serialization.py", line 705, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "C:\Users\Mok\anaconda3\lib\site-packages\torch\serialization.py", line 242, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

    opened by NakoloshGoodsby 9
  • Crash and reboot on a 3090ti, yikes!

    Crash and reboot on a 3090ti, yikes!

    Ran the command;

    python tortoise/do_tts.py --text "The vote has involved a series of individual workplace-based ballots across the UK and if nurses do not back action at a local level it is possible some hospitals and services will not be involved." --voice jlaw --preset fast
    

    Power cut out and machine rebooted, scary!

    opened by chrisbward 8
  • Unable to import Tortoise on Google Colab

    Unable to import Tortoise on Google Colab

    When importing TextToSpeech on Google Colab, I'm encountering this error:

    src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7f453d3c5d98

    Which crashes the runtime. The runtime type is GPU.

    opened by JamesLefrere 8
  • an error when trying to run the thing

    an error when trying to run the thing

    hello everyone for some reason, when I run the do_tts python script, I am getting this error: I input my text and select the voice to use, but I still get this: I also have an NVIDIA GPU which I have used to train tacotron models, so I really don't know what is happening here Traceback (most recent call last): File "C:\Users\thema\Downloads\tortoise-tts-main\do_tts.py", line 22, in tts = TextToSpeech() File "C:\Users\thema\Downloads\tortoise-tts-main\api.py", line 201, in init self.vocoder.load_state_dict(torch.load('.models/vocoder.pth')['model_g']) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 712, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1046, in _load result = unpickler.load() File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1016, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1001, in load_tensor wrap_storage=restore_location(storage, location), File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 176, in default_restore_location result = fn(storage, location) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 152, in _cuda_deserialize device = validate_cuda_device(location) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 136, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. if anyone could help me with this, I would really appreciate it!

    opened by ethanm22 8
  • Zero Shot Intonation

    Zero Shot Intonation

    As we all have seen with the latest papers, how you prompt transformer models can greatly influence their outputs. See the typical DALL-E/Disco Diffusion prompts or the PALM paper's section on Chain-of-Thought Prompting.

    Prompting engineering is model specific, effected by the training set.

    As an example, prompts like so do not invoke intonations aligned with the text, instead the double quotes cause shifts away from the readers voice, as people likely do in the training set, reading the text as the character.

    She said with a happy voice, "I start my new job today". They said happily with a happy voice, "I start my new job today".

    Matching is not necessarily expected here, so I did some further testing generating samples prompted differently to see if your model can exhibit this behavior.

    Here are my results.

    output.zip

    Here are my anecdotal findings.

    Typical Sampling is a must if you care for expressiveness. Though there is a noticeable quality drop.

    PROMPT A : She said with a sad voice, "I start my new job today". PROMPT B : It is so sad, I start my new job today. PROMPT C1 : Sad, I start my new job today. PROMPT C2 : Happy, I start my new job today.

    Using A, there is a perceptual change in prosody almost in every sample over the "I start my new job today" section. This is expected, assuming aspects of the training set, where readers change to reading character quotes.

    Quotes should be avoided unless you are going for an "audio book reader effect".

    Surprisingly though, A never actually produces a "sad" sounding sentence. This could be for many reasons, I'll leave speculation now.

    Both B and C usefully give nice intonation aligned with the prompt. With B using winning out but requiring more setup. C seems to be sufficient and simple enough you can use it automatically.

    Further thoughts, just writing things.

    • Does Typical Sampling help? Do we get more intonation out of the model when prompting? Yes, maybe, kinda.
    • I think you have a nice balance here, Typical Sampling to pull the sampling into something novel and specific and then CLVP to pull in back.
    • Is CLVP restricting the generation of intonation?
    • Is top K really the best choice? Really since we've spent the time computing we should dump the top 3 or 5. Humans are the best similarity measures we have.
    • Where should the intonation be injected, along with the start token, or maybe along with the diffusion model's inputs?
    • Where does the reliance on the vocoder to carry this information come in? Should we be passing hints to the vocoder?

    This isn't my area, but I'm interested in tinkering around.

    opened by honestabelink 7
  • Read.py Prompt Engineering Error

    Read.py Prompt Engineering Error

    Attempting to use prompt engineering with Read.py leads to this error:

    Traceback (most recent call last): File "tortoise/beyond.py", line 68, in gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, File "/media/5TB/Env/TortoiseTTS/tortoise-tts/tortoise/api.py", line 302, in tts_with_preset return self.tts(text, **kwargs) File "/media/5TB/Env/TortoiseTTS/tortoise-tts/tortoise/api.py", line 451, in tts wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates] File "/media/5TB/Env/TortoiseTTS/tortoise-tts/tortoise/api.py", line 451, in wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates] File "/media/5TB/Env/TortoiseTTS/tortoise-tts/tortoise/api.py", line 449, in potentially_redact return self.aligner.redact(clip.squeeze(1), text).unsqueeze(1) File "/home/beyond/.local/share/virtualenvs/TortoiseTTS-3YyTW6Mb/lib/python3.8/site-packages/TorToiSe-2.4.0-py3.8.egg/tortoise/utils/wav2vec_alignment.py", line 141, in redact File "/home/beyond/.local/share/virtualenvs/TortoiseTTS-3YyTW6Mb/lib/python3.8/site-packages/TorToiSe-2.4.0-py3.8.egg/tortoise/utils/wav2vec_alignment.py", line 105, in align AssertionError: Something went wrong with the alignment algorithm. I've dumped a file, 'alignment_debug.pth' toyour current working directory. Please report this along with the file so it can get fixed.

    Is there any fix for this? alignment_debug.pth.zip

    opened by nulflux 6
  • ModuleNotFoundError

    ModuleNotFoundError

    After following the installation and all went successfull, when I am trying to test it and I am getting this error:

    ModuleNotFoundError: No module named 'tortoise.models'

    What am I missing?

    opened by Tobe2d 6
  • Custom voices are not supported in colab

    Custom voices are not supported in colab

    Hi, thank you for making and sharing this awesome work.

    I've tried to add a custom voice right in the colab, but got this error. However, everything worked fine when I ran it locally.

    Screen Shot 2022-05-04 at 14 21 11
    opened by bakharew 6
  • Running read.py in Colab?

    Running read.py in Colab?

    Has anyone sorted out how to get this working?

    I added a cell with "!python read.py --textfile /tortoise-tts/txt.txt --voice custom --preset fast"

    It appears to generate the first sentence but upon completion, rather than continuing to the next segment, gives an error:

    Attempt to free invalid pointer 0x7fae07039d98

    opened by cboRD181 5
  • Multiple speakers defined in input text

    Multiple speakers defined in input text

    Is it possible to define multiple speakers for different portions of the input text that you feed to read.py?

    Maybe via SSML syntax or, but I'm dreaming here, with natural language inside brackets (e.g., [Tom speaks:])?

    opened by system1system2 1
  • Unable to load pre-trained models on hugging face

    Unable to load pre-trained models on hugging face

    Just repost https://huggingface.co/jbetker/tortoise-tts-finetuned-lj/discussions/1

    I understand the risk of fine-tuning instruction and your reservation about sharing it.

    Just out of curiosity, is the loading of fine-tuned weights meant to be broken?

    opened by ti3x 0
  • Question on the speed

    Question on the speed

    How long did it take it to render the red riding hood on a single RTX 3090 in each quality? I am thinking about buying a RTX but they are a little expensive and would like to know if the ram requirements (24GB) are required. Or can i go with a smaler card like 1080ti or 2080ti?

    Additionally did you try to use Nebullvmto accelerate the model? (For Linux Users) Or much better for Windows Users adding support for Microsofts direct ML so all GPU's can use it? (speed up for Windows Users with AMD/Intel/Nvidia GPU), DirectML supports pytorch....

    Thanks for the answer... (Questions from a poor AMD card user :-) ....)

    opened by snapo 2
  • AttributeError: 'list' object has no attribute 'encode'

    AttributeError: 'list' object has no attribute 'encode'

    Hi.

    In the last couple of days, I have noticed, that after running: gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 4 preset=preset)

    I get: AttributeError: 'list' object has no attribute 'encode'

    This has not happened, when I ran the code 1 week ago. Was something changed in the repository, which would cause this?

    opened by tralala87 0
Owner
James Betker
Pilot, father, biker, gardener, and unabashed computer geek. I'm currently working on building kickass generative models.
James Betker
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 7, 2023
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

null 164 Jan 2, 2023
Repository for the paper: VoiceMe: Personalized voice generation in TTS

?? VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 5, 2022
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Mozilla 6.5k Jan 8, 2023
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 1, 2023
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022