PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Chung-Ming Chien

Last update: Dec 30, 2022

Related tags

Text Data & NLP FastSpeech2

Overview

FastSpeech 2 - PyTorch Implementation

This is a PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This project is based on xcmyz's implementation of FastSpeech. Feel free to use/modify the code.

There are several versions of FastSpeech 2. This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.

Updates

2021/7/8: Release the checkpoint and audio samples of a multi-speaker English TTS model trained on LibriTTS
2021/2/26: Support English and Mandarin TTS
2021/2/26: Support multi-speaker TTS (AISHELL-3 and LibriTTS)
2021/2/26: Support MelGAN and HiFi-GAN vocoder

Audio Samples

Audio samples generated by this implementation can be found here.

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/LJSpeech/, output/ckpt/AISHELL3, or output/ckpt/LibriTTS/.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

For Mandarin multi-speaker TTS, try

python3 synthesize.py --text "大家好" --speaker_id SPEAKER_ID --restore_step 600000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml

For English multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT"  --speaker_id SPEAKER_ID --restore_step 800000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml

The generated utterances will be put in output/result/.

Here is an example of synthesized mel-spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition", with the English single-speaker TTS model.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.

We take LJSpeech as an example hereafter.

Preprocessing

First, run

python3 prepare_align.py config/LJSpeech/preprocess.yaml

for some preparations.

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments of the supported datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

After that, run the preprocessing script by

python3 preprocess.py config/LJSpeech/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The model takes less than 10k steps (less than 1 hour on my GTX1080Ti GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Implementation Issues

Following xcmyz's implementation, I use an additional Tacotron-2-styled Post-Net after the decoder, which is not used in the original FastSpeech 2.
Gradient clipping is used in the training.
In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to config/README.md for more details.

Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.

References

Citation

@INPROCEEDINGS{chien2021investigating,
  author={Chien, Chung-Ming and Lin, Jheng-Hao and Huang, Chien-yu and Hsu, Po-chun and Lee, Hung-yi},
  booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech}, 
  year={2021},
  volume={},
  number={},
  pages={8588-8592},
  doi={10.1109/ICASSP39728.2021.9413880}}

Comments

RuntimeError: Error(s) in loading state_dict for FastSpeech2:

When I tried to load the pretrained model output/LJSpeech/ckpt/900000.pth.tar, I have some errors:

size mismatch for encoder.src_word_emb.weight: copying a param with shape torch.Size([361, 256]) from checkpoint, the shape in current model is torch.Size([151, 256]).

The code which loads the model from repo

base_config_path = "config/LJSpeech"
prepr_path = f"{base_config_path}/preprocess.yaml"
model_path = f"{base_config_path}/model.yaml"
train_path = f"{base_config_path}/train.yaml"

prepr_config = yaml.load(open(prepr_path, "r"), Loader=yaml.FullLoader)
model_config = yaml.load(open(model_path, "r"), Loader=yaml.FullLoader)
train_config = yaml.load(open(train_path, "r"), Loader=yaml.FullLoader)
configs = (prepr_config, model_config, train_config)
cpkt_path = "output/LJSpeech/ckpt/900000.pth.tar"

def get_model(ckpt_path, configs):
    (preprocess_config, model_config, train_config) = configs
    model = FastSpeech2(preprocess_config, model_config).to(device)
    ckpt = torch.load(ckpt_path)
    model.load_state_dict(ckpt["model"])
    model.eval()
    model.requires_grad_ = False
    return model

opened by leminhnguyen 15

Coding error when run preprocessor.py

Hello, I am trying to train a model using LJSpeech dataset, following steps in README.md and got this error:

$ python3 preprocess.py config/LJSpeech/preprocess.yaml Processing Data ... 0%| | 0/1 [00:00<?, ?it/s] Traceback (most recent call last): File "preprocess.py", line 15, in <module> preprocessor.build_from_path() File "/FastSpeech2/preprocessor/preprocessor.py", line 85, in build_from_path if len(pitch) > 0: UnboundLocalError: local variable 'pitch' referenced before assignment I've already downloaded TextGrid files and put them in '/FastSpeech2/preprocessed_data/LJSpeech/LJSpeech/TextGrid'

opened by EuphoriaCelestial 12
The loss of variance_adaptor(Mandarin dataset)

Hi, I used Mandarin dataset(BIAOBEI) to train FastSpeech2. The loss of mel and PostNet mel seems no problem. But I find out that the loss of variance_adaptor (Duration Loss, F0 Loss and Energy Loss) is really high.

The following is a part of my log: Epoch [191/1000], Step [115650/608000]: Total Loss: 68.7120, Mel Loss: 0.2892, Mel PostNet Loss: 0.2889, Duration Loss: 2.2572, F0 Loss: 59.7572, Energy Loss: 6.1195; Time Used: 28493.331s, Estimated Time Remaining: 97800.871s.

How could I solve this?

Thank you.

opened by humanlost 11
Why the wav quality with frame level is much worse than with phoneme level??

The implementation with frame-level seems more similar to the implementation in the paper. But why the quality is much worse than with phoneme level? And with phoneme level, the pitch is predicted before expanding length, which results in the pitch in a phoneme are the same. I think the pitches in a phoneme sometimes change. But... with phoneme level has excellent performance...

opened by WuMing757 10

Problem with VarianceAdaptor implementation

Here : https://github.com/ming024/FastSpeech2/blob/bd4c3413c90c5c1310066b1051d5c67abb75e2fa/modules.py#L48

and then https://github.com/ming024/FastSpeech2/blob/bd4c3413c90c5c1310066b1051d5c67abb75e2fa/modules.py#L50

But as per paper detail the input of Energy predictor should be output of length regulator not the output of pitch predictor. See the fastspeech 2 diagram clearly input of Energy predictor is x output of length regulator without pitch component. Actual code should be like:

 def forward(self, x, duration_target=None, pitch_target=None, energy_target=None, max_length=None):

        duration_prediction = self.duration_predictor(x)
        if duration_target is not None:
            x, mel_pos = self.length_regulator(x, duration_target, max_length)
        else:
            duration_rounded = torch.round(duration_prediction)
            x, mel_pos = self.length_regulator(x, duration_rounded)
        
        pitch_prediction = self.pitch_predictor(x)
        if pitch_target is not None:
            pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_target, self.pitch_bins))
        else:
            pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
  
        
        energy_prediction = self.energy_predictor(x)
        if energy_target is not None:
            energy_embedding = self.energy_embedding(torch.bucketize(energy_target, self.energy_bins))
        else:
            energy_embedding = self.energy_embedding(torch.bucketize(energy_prediction, self.energy_bins))

        x = x + pitch_embedding
        x = x + energy_embedding
        
        return x, duration_prediction, pitch_prediction, energy_prediction, mel_pos

opened by rishikksh20 9

Is the predicted wav realy better than fastspeech1?

I tried to train on my own dataset, but the result is not as good as I expected, and even worse than fastspeech1. I use default setting , phone-level, with hifigan as vocoder How about your results?

opened by Liujingxiu23 8
problems on chinese dataset

hi, i train fastspeech2 on the biaobei dataset. the result seems that the pitch predictor and the duration predictor doesn't work well, but if you input the groud truth, you can get a good result. so does the your demo use the ground truth pitch and energy or use the predictor to predict the pitch and energy? i find other fastspeech2 code on tensorflowTTS and espnet2, they use the structure different from the orginal paper. they use length regulator after all the predictors finished, and pitch and energy predictor works well.

opened by yuzuda283 8
Train from Scratch Multispeaker with LJS and Custom Dataset

Hey I have a custom datasets, that contains about 12 hours of data, I want to pre-train with multi speaker using LJSSpeech.

Would I take the pretrained FS2 - LJS Speech model Go into model.yml

change to multi_speaker: True speaker: 'universal'

ensure that the preprocessor outputs my data like so ? 6694_70837_000012_000001|6694|{P IY1 S}|peace! name of file | speaker | alignment | utterance?

opened by ArEnSc 7
Male voice

First thank you, I have solved the issue opened thanks to you support. In my understanding both Melgan (that I have tried) and Waveglow (not run yet) have a female voice. To have a male voice is it necessary to train from scratch the model? Or add support to a specific vocoder?

Thank you.

opened by loretoparisi 7
MFA alignment is not accurate

Hi, I am using MFA to do force alignment on my own dataset. I found that the alignment results is not accurate.

my dataset is 15 hours, maybe not big enough for training am from scratch. after increased my dataset to 30 hours, still not accurate. Do you have some trick to improve the alignment?

opened by joan126 6

AttributeError: module 'torch' has no attribute 'bucketize'

I get the following error:

root@75adae8f35d1:/app# python3 synthesize.py --step 300000
|{DH AH0 N EY1 SH AH0 N Z T UH1 R IH2 Z AH0 M M IH1 N AH0 S T ER0 HH AE1 Z AO1 L S OW0 EH0 N K ER1 IH0 JH D AO2 S T R EY1 L Y AH0 N Z T UW1 T EY1 K DH EH1 R HH AA1 L AH0 D EY2 Z W IH0 DH IH1 N DH AH0 K AH1 N T R IY0 DH IH1 S Y IH1 R} |
Traceback (most recent call last):
  File "synthesize.py", line 94, in <module>
    synthesize(model, text, sentence, prefix='step_{}'.format(args.step))
  File "synthesize.py", line 48, in synthesize
    mel, mel_postnet, duration_output, f0_output, energy_output = model(text, src_pos)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/app/fastspeech2.py", line 33, in forward
    encoder_output, d_target, p_target, e_target, max_length)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/app/modules.py", line 47, in forward
    pitch_embedding = self.pitch_embedding(torch.bucketize(pitch_prediction, self.pitch_bins))
AttributeError: module 'torch' has no attribute 'bucketize'

I'm running in CPU, and I had to modify get_FastSpeech2 like this:

def get_FastSpeech2(num):
    checkpoint_path = os.path.join(hp.checkpoint_path, "checkpoint_{}.pth.tar".format(num))
    model = nn.DataParallel(FastSpeech2())
    if torch.cuda.is_available():
        model.load_state_dict(torch.load(checkpoint_path)['model'])
    else:
        model.load_state_dict(torch.load(checkpoint_path, map_location=torch.device('cpu'))['model'])
    model.requires_grad = False
    model.eval()
    return model

to set map_location to cpu device.

opened by loretoparisi 6

[nltk_data] Error loading averaged_perceptron_tagger:

I met an error when run the synthesize.py, I had already put the 900000.pt.tar model to the directry

(torch) zhonghuihang@kdf-X12DAi-N6:~/fastSpeech2-master$ python3 synthesize.py --text "hello" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error [nltk_data] [Errno 111] Connection refused> [nltk_data] Error loading cmudict: <urlopen error [Errno 111] [nltk_data] Connection refused> Traceback (most recent call last): File "synthesize.py", line 188, in model = get_model(args, configs, device, train=False) File "/home/zhonghuihang/fastSpeech2-master/utils/model.py", line 20, in get_model ckpt = torch.load(ckpt_path) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 367, in load return _load(f, map_location, pickle_module) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 528, in _load magic_number = pickle_module.load(f) _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.

opened by lunar333 0

the last words of my synthesized voice are often unclea

Hello, the last words of my synthesized voice are often unclear. If the sentence is too long, the second half will collapse. How to solve this problem

opened by tuntun990606 0
FastSpeechs training error

the preprocess process was running successfully ,but when I running the train code ,it has a error as followed

(torch) zhonghuihang@kdf-X12DAi-N6:~/fastSpeech2-master$ python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml Prepare training ... Number of FastSpeech2 Parameters: 35159361 Traceback (most recent call last): File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 186, in nti s = nts(s, "ascii", "strict") File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 170, in nts return s.decode(encoding, errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 2: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 2289, in next tarinfo = self.tarinfo.fromtarfile(self) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 1095, in fromtarfile obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 1037, in frombuf chksum = nti(buf[148:156]) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 189, in nti raise InvalidHeaderError("invalid header") tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 595, in _load return legacy_load(f) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 506, in legacy_load with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar,
File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 1593, in open return func(name, filemode, fileobj, **kwargs) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 1623, in taropen return cls(name, mode, fileobj, **kwargs) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 1486, in init self.firstmember = self.next() File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/tarfile.py", line 2301, in next raise ReadError(str(e)) tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 198, in main(args, configs) File "train.py", line 48, in main vocoder = get_vocoder(model_config, device) File "/home/zhonghuihang/fastSpeech2-master/utils/model.py", line 63, in get_vocoder ckpt = torch.load("/home/zhonghuihang/fastSpeech2-master/hifigan/generator_LJSpeech.pth.tar.zip") File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 426, in load return _load(f, map_location, pickle_module, **pickle_load_args) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/serialization.py", line 599, in _load raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name)) RuntimeError: /home/zhonghuihang/fastSpeech2-master/hifigan/generator_LJSpeech.pth.tar.zip is a zip archive (did you mean to use torch.jit.load()?) (torch) zhonghuihang@kdf-X12DAi-N6:~/fastSpeech2-master$ python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml Prepare training ... channel 3: open failed: connect failed: Connection refused channel 4: open failed: connect failed: Connection refused channel 3: open failed: connect failed: Connection refused channel 4: open failed: connect failed: Connection refused channel 3: open failed: connect failed: Connection refused channel 4: open failed: connect failed: Connection refused Number of FastSpeech2 Parameters: 35159361 Removing weight norm... Training: 0%| | 0/900000 [00:00<?, ?it/sTraceback (most recent call last): | 0/197 [00:00<?, ?it/s] File "train.py", line 198, in main(args, configs) File "train.py", line 82, in main output = model(*(batch[2:])) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/fastSpeech2-master/model/fastspeech2.py", line 66, in forward output = self.encoder(texts, src_masks) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/fastSpeech2-master/transformer/Models.py", line 95, in forward enc_output, mask=mask, slf_attn_mask=slf_attn_mask File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/fastSpeech2-master/transformer/Layers.py", line 23, in forward enc_input, enc_input, enc_input, mask=slf_attn_mask File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/fastSpeech2-master/transformer/SubLayers.py", line 39, in forward q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear output = input.matmul(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) Training: 0%| | 1/900000 [00:00<142:38:55, 1.75it/s] Exception ignored in: <function tqdm.del at 0x7ff84f88e4d0> Traceback (most recent call last): File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1086, in del File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 1270, in close File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/std.py", line 572, in _decr_instances File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/site-packages/tqdm/_monitor.py", line 51, in exit File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/threading.py", line 522, in set File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/threading.py", line 365, in notify_all File "/home/zhonghuihang/miniconda3/envs/torch/lib/python3.7/threading.py", line 348, in notify TypeError: 'NoneType' object is not callable

opened by lunar333 0
Facing dimension mismatch

I am facing a dimension mismatch issue in pitch embedding addition to the encoder output . When I am trying to train the Fastspeech2 model on Hindi data. The Screenshot of the issue is attached as under.

Note :- I have made the necessary changes in the script to adapt it for Hindi dataset

The same code runs for LJSpeech dataset but fails for Hindi Dataset. Kindly help me resolve the issue !

opened by nayanjha16 1
how to align MFA By myself.

It seems like MFA has changed to new version and your instructions in README is not valid anymore. There is no directory named ./montreal-forced-aligner/bin/mfa_align

I also tried to install MFA folloewd the instruction in https://montreal-forced-aligner.readthedocs.io/en/latest/. When i run the code mfa align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech . It got the error : `FileNotFoundError: [Errno 2] No such file or directory: '/root/Documents/MFA/LJSpeech/corpus_data/split1/feats.librispeech-lexicon.0.scp'

What should i do next?

opened by tuannvhust 1

Owner

Chung-Ming Chien

Graduate Student, NTU CSIE | Speech Processing Lab. | Speech synthesis & Natural language processing

GitHub

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

End to end text to speech system using gruut and onnx

673 Dec 28, 2022

A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

647 Dec 25, 2022

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

919 Jan 3, 2023

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

26 Dec 14, 2022

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

86 Jun 11, 2021

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

513 Jan 3, 2023

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

34 Sep 8, 2022

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

42 Dec 13, 2022

End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

5.9k Jan 3, 2023

End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

214 Dec 7, 2022

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

21 Dec 20, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

45 Oct 4, 2022

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper：An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

3 Apr 2, 2022