TTS is a library for advanced Text-to-Speech generation.

Mozilla

Last update: Jan 8, 2023

Related tags

Text Data & NLP python text-to-speech deep-learning speech pytorch tts vocoder tacotron tensorflow2 tacotron2 melgan speaker-encoder dataset-analysis glow-tts multiband-melgan gantts

Overview

TTS: Text-to-Speech for all.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

📢 English Voice Samples and SoundCloud playlist

👨‍🍳 TTS training recipes

📄 Text-to-Speech paper collection

💬 Where to ask questions

Please use our dedicated channels for questions and discussion. Help is much more valuable if it's shared publicly, so that more people can benefit from it.

Type	Platforms
🚨 Bug Reports	GitHub Issue Tracker
❔ FAQ	TTS/Wiki
🎁 Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	Discourse Forum
🗯 General Discussion	Discourse Forum and Matrix Channel

🔗 Links and Resources

Type	Links
💾 Installation	TTS/README.md
👩🏾‍🏫 Tutorials and Examples	TTS/Wiki
🚀 Released Models	TTS/Wiki
💻 Docker Image	Repository by @synesthesiam
🖥️ Demo Server	TTS/server
🤖 Running TTS on Terminal	TTS/README.md
✨ How to contribute	TTS/README.md

🥇 TTS Performance

"Mozilla*" and "Judy*" are our models. Details...

Features

High performance Deep Learning models for Text2Speech tasks.
- Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
- Speaker Encoder to compute speaker embeddings efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN)
Fast and efficient model training.
Detailed training logs on console and Tensorboard.
Support for multi-speaker TTS.
Efficient Multi-GPUs training.
Ability to convert PyTorch models to Tensorflow 2.0 and TFLite for inference.
Released models in PyTorch, Tensorflow and TFLite.
Tools to curate Text2Speech datasets underdataset_analysis.
Demo server for model testing.
Notebooks for extensive model benchmarking.
Modular (but not too much) code base enabling easy testing for new ideas.

Implemented Models

Text-to-Spectrogram

Tacotron: paper
Tacotron2: paper
Glow-TTS: paper
Speedy-Speech: paper

Attention Methods

Guided Attention: paper
Forward Backward Decoding: paper
Graves Attention: paper
Double Decoder Consistency: blog

Speaker Encoder

GE2E: paper
Angular Loss: paper

Vocoders

MelGAN: paper
MultiBandMelGAN: paper
ParallelWaveGAN: paper
GAN-TTS discriminators: paper
WaveRNN: origin
WaveGrad: paper

You can also help us implement more models. Some TTS related work can be found here.

Install TTS

TTS supports python >= 3.6, <3.9.

If you are only interested in synthesizing speech with the released TTS models, installing from PyPI is the easiest option.

pip install TTS

If you plan to code or train models, clone TTS and install it locally.

git clone https://github.com/mozilla/TTS
pip install -e .

Directory Structure

|- notebooks/       (Jupyter Notebooks for model evaluation, parameter selection and data analysis.)
|- utils/           (common utilities.)
|- TTS
    |- bin/             (folder for all the executables.)
      |- train*.py                  (train your target model.)
      |- distribute.py              (train your TTS model using Multiple GPUs.)
      |- compute_statistics.py      (compute dataset statistics for normalization.)
      |- convert*.py                (convert target torch model to TF.)
    |- tts/             (text to speech models)
        |- layers/          (model layer definitions)
        |- models/          (model definitions)
        |- tf/              (Tensorflow 2 utilities and model implementations)
        |- utils/           (model specific utilities.)
    |- speaker_encoder/ (Speaker Encoder models.)
        |- (same)
    |- vocoder/         (Vocoder models.)
        |- (same)

Sample Model Output

Below you see Tacotron model state after 16K iterations with batch-size 32 with LJSpeech dataset.

"Recent research at Harvard has shown meditating for as little as 8 weeks can actually increase the grey matter in the parts of the brain responsible for emotional regulation and learning."

Audio examples: soundcloud

Datasets and Data-Loading

TTS provides a generic dataloader easy to use for your custom dataset. You just need to write a simple function to format the dataset. Check datasets/preprocess.py to see some examples. After that, you need to set dataset fields in config.json.

Some of the public datasets that we successfully applied TTS:

Example: Synthesizing Speech on Terminal Using the Released Models.

After the installation, TTS provides a CLI interface for synthesizing speech using pre-trained models. You can either use your own model or the release models under the TTS project.

Listing released TTS models.

tts --list_models

Run a tts and a vocoder model from the released model list. (Simply copy and paste the full model names from the list as arguments for the command below.)

tts --text "Text for TTS" \
    --model_name "///" \
    --vocoder_name "///" \
    --out_path folder/to/save/output/

Run your own TTS model (Using Griffin-Lim Vocoder)

tts --text "Text for TTS" \
    --model_path path/to/model.pth.tar \
    --config_path path/to/config.json \
    --out_path output/path/speech.wav

Run your own TTS and Vocoder models

tts --text "Text for TTS" \
    --model_path path/to/config.json \
    --config_path path/to/model.pth.tar \
    --out_path output/path/speech.wav \
    --vocoder_path path/to/vocoder.pth.tar \
    --vocoder_config_path path/to/vocoder_config.json

Note: You can use ./TTS/bin/synthesize.py if you prefer running tts from the TTS project folder.

Example: Training and Fine-tuning LJ-Speech Dataset

Here you can find a CoLab notebook for a hands-on example, training LJSpeech. Or you can manually follow the guideline below.

To start with, split metadata.csv into train and validation subsets respectively metadata_train.csv and metadata_val.csv. Note that for text-to-speech, validation performance might be misleading since the loss value does not directly measure the voice quality to the human ear and it also does not measure the attention module performance. Therefore, running the model with new sentences and listening to the results is the best way to go.

shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv

To train a new model, you need to define your own config.json to define model details, trainin configuration and more (check the examples). Then call the corressponding train script.

For instance, in order to train a tacotron or tacotron2 model on LJSpeech dataset, follow these steps.

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json

To fine-tune a model, use --restore_path.

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar

To continue an old training run, use --continue_path.

python TTS/bin/train_tacotron.py --continue_path /path/to/your/run_folder/

For multi-GPU training, call distribute.py. It runs any provided train script in multi-GPU setting.

CUDA_VISIBLE_DEVICES="0,1,4" python TTS/bin/distribute.py --script train_tacotron.py --config_path TTS/tts/configs/config.json

Each run creates a new output folder accomodating used config.json, model checkpoints and tensorboard logs.

In case of any error or intercepted execution, if there is no checkpoint yet under the output folder, the whole folder is going to be removed.

You can also enjoy Tensorboard, if you point Tensorboard argument--logdir to the experiment folder.

Contribution Guidelines

This repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details, please read the Mozilla Community Participation Guidelines.

Create a new branch.
Implement your changes.
(if applicable) Add Google Style docstrings.
(if applicable) Implement a test case under tests folder.
(Optional but Prefered) Run tests.

./run_tests.sh

Run the linter.

pip install pylint cardboardlint
cardboardlinter --refspec master

Send a PR to dev branch, explain what the change is about.
Let us discuss until we make it perfect :).
We merge it to the dev branch once things look good.

Feel free to ping us at any step you need help using our communication channels.

Collaborative Experimentation Guide

If you like to use TTS to try a new idea and like to share your experiments with the community, we urge you to use the following guideline for a better collaboration. (If you have an idea for better collaboration, let us know)

Create a new branch.
Open an issue pointing your branch.
Explain your idea and experiment.
Share your results regularly. (Tensorboard log files, audio results, visuals etc.)

Major TODOs

Implement the model.
Generate human-like speech on LJSpeech dataset.
Generate human-like speech on a different dataset (Nancy) (TWEB).
Train TTS with r=1 successfully.
Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
Multi-speaker embedding.
Model optimization (model export, model pruning etc.)

Acknowledgement

https://github.com/keithito/tacotron (Dataset pre-processing)
https://github.com/r9y9/tacotron_pytorch (Initial Tacotron architecture)
https://github.com/kan-bayashi/ParallelWaveGAN (vocoder library)
https://github.com/jaywalnut310/glow-tts (Original Glow-TTS implementation)
https://github.com/fatchord/WaveRNN/ (Original WaveRNN implementation)

Comments

Tacotron2 + WaveRNN experiments
Tacotron2: https://arxiv.org/pdf/1712.05884.pdf WaveRNN: https://github.com/erogol/WaveRNN forked from https://github.com/fatchord/WaveRNN

The idea is to add Tacotron2 as another alternative if it is really useful then the current model.

[x] Code boilerplate tracotron2 architecture.

[x] Train Tacotron2 and compare results (Baseline)

[x] Train TTS current model in a comparable size with T2. (Current TTS model has 7M and Tacotron2 has 28M parameters)

[x] Add TTS specific architectural changes to T2 and compare with the baseline.

[x] Train WaveRNN a vocoder on generated spectrograms

[x] Train a better stopnet. Stopnet sometimes misses the prediction that leads to unstable predictions. Maybe it is better to use a RNN as previous TTS version.

[x] Release LJspeech Tacotron 2 model. (soon)

[x] Release LJSpeech WaveRNN model. (https://github.com/erogol/WaveRNN)

Best result so far: https://soundcloud.com/user-565970875/ljspeech-logistic-wavernn

Some findings:

Adding an entropy loss for the attention seems to improve the cases hard to learn the alignment. It forces network to learn more sparse and noise free alignment weights.

entropy = torch.distributions.Categorical(probs=alignments).entropy() entropy_loss = (entropy / np.log(alignments.shape[1])).mean() loss += 1e-4 * entropy_loss

Here is the alignment with entropy loss. However, if you keep the loss weight high, then it degrades the model's generalization for new words.

Replacing Prenet with a BatchNorm version ehnace the performance quite a lot.

A network with BN Prenet is harder to learn the attention. It looks like the network needs a level of noise onto autoregressive connection to relate encoder output to network output. Otwerwise, in teacher forcing mode, network does not need encoder output since it finds previous prediction frame enough to generate the next frame.

Forward attention seems more robust to longer sequences and faster to align. (https://arxiv.org/abs/1807.06736)

improvement experiment
opened by erogol 80
Train a better Speaker Encoder
Our current speaker encoder is trained with only LibriTTS (100, 360) datasets. However, we can improve its performance using other available datasets (VoxCeleb, LibriTTS-500, Common Voice etc.). It will also increase the performance ofour multi-speaker model and makes it easier to adapt to new voices.

I can't really work on this alone due to the recent changes and the amount of work needed therefore I need some hand here to work together.

So I can list the TODO as follows and feel free to contribute to any part of it or suggest changes;

[x] decide target datasets

[x] download and preprocess the datasets

[x] write preprocessors for new datasets

[x] increase the efficiency of the speaker encoder data-loader.

[x] training a model only using Eng datasets.

[x] training a model with all the available datasets.

improvement help wanted discussion
opened by erogol 79
[Discussion] WaveGrad

This is not an issue and is more of a discussion. I read the WaveGrad paper today (which may be found here) and listened to the samples here, which sound very good. There seems to be an open source implementation already here with great progress. Has anyone read the paper or used this implementation?
wontfix discussion

opened by george-roussos 76
Multi Speaker Embeddings
Hi @erogol, I've been a bit off the radar for the past month because of vacation and other projects, but now I am back and ready for action! I am looking into how to do multi speaker embeddings, and here's my current plan of action:

Have all preprocessors output items that also have a speaker ID to be used down the line. Formats that do not have explicit speaker ids, i.e. all current preprocessors, would use a uniform ID. This speaker ID must then be passed down by the dataset through the collate function and into the forward pass of the model.

Add speaker embeddings to the model. An additional embedding with configurable number of speakers and embedding dimensionality. The embedding vector is retrieved based on speaker id and then replicated and concatenated to each encoder output. The result is passed to the decoder as before. Here we could also easily ignore speaker embeddings if we only deal with a single speaker.

It might make sense to let speaker embeddings put some constraints on the train/dev/test split, i.e. every speaker in the dev/test set should at least have some examples in the train set, otherwise their embeddings are never learned. I could implement a check for that and issue a warning if this isn't the case.

Any thoughts or additional hints on this?
wontfix
opened by twerkmeister 51
[New-Model] Implement Multilingual Speech Synthesis

I was wondering if anyone else would be interested by the implementation of this paper in the mozilla/TTS repo : "Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning"

I think that having the possibility of using code-switching is a huge plus for non English models since English is use in everyday life and not being able to pronounce English words in French for example limit the usability of the model. (my model can't say parking)

Furthermore, I hope that combining this with the new encoder we have trained would maybe allow for voice cloning in language with low resources (or at least have more voices available).

I'm a beginner when it comes to pytorch but I would love to help implementing this paper although I'm not sure I can do it alone.

What do you think ? Would it be interesting to have that in the repo ? would it be hard to implement ? Who would be willing to help ?

Thanks for reading
wontfix new-model

opened by WeberJulian 43
[Poll] Should we include WaveRNN in Mozilla TTS ?

I see a lot of people still use WaveRNN although we released new faster vocoders.

I am not willing to invest time in it given the way faster alternatives but you can let us know if you like to see WaveRNN as a part of Mozilla TTS repo.

Please give thumps up or down to this post to have a poll.

You can also state your comment or reason to have WaveRNN below.
help wanted poll

opened by erogol 40
Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech

Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing

This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN https://github.com/erogol/WaveRNN

(Ignore the small jiggle on the figures caused by TB)

model-release

opened by erogol 36
Parallel_wavegan tensorboard results weird

I used the dev branch training PWGAN, then i looked into the tensorboard results, it seems that the spectrograms look weird. May i ask whether i did something wrong or i miss something?

I used the original parallel_wavegan_config.json.

wontfix

opened by PPGGG 33
Introduce github action for CI
It seemed to me like Travis-CI checks are not working anymore. I'm aware of the new pricing policy they introduced recently and suspected it might be due to that.

The CI last ran somewhere mid-october.

Since this project is hosted on GitHub, I believe their actions feature might be a good fit for the time being. So I started to port the travis tests to the best of my understanding. I hope that is alright.

You can look at the current state over here: https://github.com/mweinelt/TTS/actions/runs/363907718

There is currently the following issue, that was introduced in 39c71ee8a98bcbfea242e6b203556150ee64205b:

====================================================================== ERROR: Test if all layers are updated in a basic training cycle ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/runner/work/TTS/TTS/tests/test_wavegrad_train.py", line 36, in test_train_step model.compute_noise_level(1000, 1e-6, 1e-2) TypeError: compute_noise_level() takes 2 positional arguments but 4 were given ----------------------------------------------------------------------

I'll happyily rebase once this fix has hit the dev branch, so we can check if this works.
opened by mweinelt 32
prenet dropout

I was using another repo previously, and now I am switching to mozilla TTS;

according to my experience, the dropout in decoder prenet also used in inference, without dropout in inference, the quality is bad(tacotron 2), which is hard to understand,

do you get similar experience and why?
experiment

opened by xinqipony 32
Multi-speaker Tacotron model training from scratch
Hi,

I'm trying to train a multi-speaker Tacotron model from scratch using VCTK + LibriTTS databases. The model trains fine until about 50K global steps but after that I start running into "CUDA out of memory", "NaN loss with key=decoder_coarse_loss", or "NaN loss with key=decoder_loss" errors. I tried reducing batch sizes, limiting input sequence lengths, and/or reducing learning rate but those didn't seem to help. I also tried training from scratch using VCTK only and ended up with similar errors. I'm training on a single Titan X GPU with 12GB memory. I didn't want to try multi-gpu training yet so I wonder if I should be setting some parameters differently in the config file. Any suggestions? Also, can someone explain the following parameters and how they should be set for single GPU training? Or, should I simply avoid single GPU training?

"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values. "num_val_loader_workers": 4, // number of evaluation data loader processes. "batch_group_size": 4, //Number of batches to shuffle after bucketing.

Thanks!

Additional info:

My branch is based on commit ea976b0543c7fa97628c41c4a936e3113896d18a

Config file attached

Tensorboard loss plots, attention alignments, output spectrograms, Griffin-Lim synthesized audio look/sound as expected before running into these errors

As far as I can tell, the errors occur pretty randomly. It could continue training a couple of thousands steps after 50K steps or fail after 500 steps. I don't also see any specific input files triggering these errors in a consistent manner.
opened by oytunturk 25

error in --list_speaker_idxs

Hello. I've installed tts via pip

tts --list_speaker_idxs generates the following error:

 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
Traceback (most recent call last):
  File "/home/user/.local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 333, in main
    print(synthesizer.tts_model.speaker_manager.name_to_id)
AttributeError: 'NoneType' object has no attribute 'name_to_id'

opened by 0x199x 0

Error in conversion from Torch to TF model

Hi I have been using convert_tacotron2_torch_to_tf.py for conversion of a downloaded Tacotron model to tf version, but I faced with an error:

AssertionError: [!] weight shapes does not match: decoder/while/attention/query_layer/linear_layer/kernel:0 vs decoder.attention.query_layer.weight --> (1024, 128) vs (128, 1024)

I think it is the bug of conversion code. Would you please help me to solve the issue?

Neda

opened by nfaraji2002 0
short word with server no finish

some time if the enter as short 'exemple 'i'm ironman' the wave file is not short and the result is :"i'm ironmannneeeueueneeueueeneeuheuuueuehhahahhhhhahhhhaahhhhahhahanuuuuuuuuhhh" to finish in imitation of a motorcycle

opened by greatAznur 0

Tacotron (2?) based models appear to be limited to rather short input

Running tts --text on some meaningful sentences results in the following output:

$ tts --text "An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future."                                                           
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.
 > Text splitted to sentences.
['An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4).', 'The rescheduling calculation is done once per second.', 'The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.']
   > Decoder stopped with `max_decoder_steps` 500
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 52.66666388511658
 > Real-time factor: 3.1740607061125763
 > Saving output to tts_output.wav

The audio file is truncated with respect to the text. If I hack the config file at TTS/tts/configs/tacotron_config.py to have a larger max_decoder_steps value, the output does seem to successfully get longer, but I'm not sure how safe this is.

Are there any better solutions? Should I use a different model?

opened by deliciouslytyped 10

Releases(v0.0.9)

v0.0.9(Jan 29, 2021)
This is the first and v0.0.9 release of TTS, an open text-to-speech engine. TTS is still an evolving project and any upcoming release might be significantly different and not backward compatible.

In this release, we provide the following models.

| Language |Dataset | Model Name| Model Type| Download| | ------------- |:------:|:-----------------:|-----------------:|----| |English | LJSpeech | TacotronDCA| tts|💾| |English | LJSpeech | Glow-TTS| tts|💾| |Spanish | M-AILabs| TacotronDDC | tts|💾| |French |M_AILabs| TacotronDDC| tts|💾| |English | LJSpeech| MB-MelGAN| vocoder|💾| |Multi-Lang | LibriTTS| FullBand-MelGAN| vocoder|💾| |Multi-Lang | LibriTTS| WaveGrad| vocoder| 💾|

Notes

Multi-Lang vocoder models are intended for non-English models.

Vocoder models are independently trained from the tts models with possibly different sampling rates. Therefore, the performance is not optimal.

All models are trained with phonemes generated by espeak back-end (not espeak-ng).

This release has been tested under Python 3.6, 3.7, and 3.8. It is strongly suggested to use conda to install the dependencies and set-up the runtime environment.
Source code(tar.gz)
Source code(zip)
tts_models--en--ljspeech--glow-tts.zip(300.47 MB)
tts_models--en--ljspeech--tacotron2-DCA.zip(298.14 MB)
tts_models--es--mai--tacotron2-DDC.zip(498.95 MB)
tts_models--fr--mai--tacotron2-DDC.zip(500.00 MB)
vocoder_models--en--ljspeech--mulitband-melgan.zip(73.11 MB)
vocoder_models--universal--libri-tts--fullband-melgan.zip(73.11 MB)
vocoder_models--universal--libri-tts--wavegrad.zip(166.83 MB)

Owner

Mozilla

This technology could fall into the right hands.

GitHub

Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

85 Dec 26, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

237 Jan 2, 2023

vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单，音质最好的语音合成系统

12 Dec 14, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 8, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

267 Dec 25, 2022

Repository for the paper: VoiceMe: Personalized voice generation in TTS

?? VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

80 Dec 29, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

TTS is a library for advanced Text-to-Speech generation.

Related tags

Overview

TTS: Text-to-Speech for all.

💬 Where to ask questions

🔗 Links and Resources

🥇 TTS Performance

Features

Implemented Models

Text-to-Spectrogram

Attention Methods

Speaker Encoder

Vocoders

Install TTS

Directory Structure

Sample Model Output

Datasets and Data-Loading

Example: Synthesizing Speech on Terminal Using the Released Models.

Example: Training and Fine-tuning LJ-Speech Dataset

Contribution Guidelines

Collaborative Experimentation Guide

Major TODOs

Acknowledgement

Comments

Releases(v0.0.9)

v0.0.9(Jan 29, 2021)

Notes

Owner

Mozilla

Ukrainian TTS (text-to-speech) using Coqui TTS

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

vits chinese, tts chinese, tts mandarin

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Command Line Text-To-Speech using Google TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Simple Speech to Text, Text to Speech

Maix Speech AI lib, including ASR, chat, TTS etc.

Repository for the paper: VoiceMe: Personalized voice generation in TTS

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation