A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Overview

Comprehensive-Transformer-TTS - PyTorch Implementation

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome :)

Transformers

Supervised Duration Modelings

Unsupervised Duration Modelings

  • One TTS Alignment To Rule Them All (Badlani et al., 2021): We are finally freed from external aligners such as MFA! Validation alignments for LJ014-0329 up to 70K are shown below as an example.

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

Model Memory Usage Training Time (1K steps)
Fastformer (lucidrains') 10531MiB / 24220MiB 4m 25s
Fastformer (wuch15's) 10515MiB / 24220MiB 4m 45s
Long-Short Transformer 10633MiB / 24220MiB 5m 26s
Conformer 18903MiB / 24220MiB 7m 4s
Reformer 10293MiB / 24220MiB 10m 16s
Transformer 7909MiB / 24220MiB 4m 51s

Toggle the type of building blocks by

# In the model.yaml
block_type: "transformer" # ["transformer", "fastformer", "lstransformer", "conformer", "reformer"]

Toggle the type of duration modelings by

# In the model.yaml
duration_modeling:
  learn_alignment: True # for unsupervised modeling, False for supervised modeling

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/. The models are trained with unsupervised duration modeling under transformer building block.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 prepare_align.py --dataset DATASET
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DATASET
    

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

  • To use a Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES= at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

  • Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
  • Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
  • Convolutional embedding is used as StyleSpeech for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as FastSpeech2.
  • Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

  • For vocoder, HiFi-GAN and MelGAN are supported.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Comments
  • weird sounding voices with MelGAN

    weird sounding voices with MelGAN

    Hello,

    Audio samples generated with multi-speaker MelGEN (haven't tried single-speaker) sound unnatural.

    I know worse quality is expected, but all samples sound like having significantly too high pitch.

    Maybe there is a bug in implementation ported from FastSpeech?

    opened by gen35 4
  • Weird sound in LONG sentence

    Weird sound in LONG sentence

    Hi, it's a really nice work! In my experiment, the speech goes weird after 10s (short sentences are all good). Losses decrese normally, and I checked the predicted dur/pitch/energy, they were all good as well. Only the mel goes weird. Have you ever encontered this kind of problem in long sentences?

    opened by blx0102 2
  • about duration predictor

    about duration predictor

    In "learn_alignment: True" mode, the input of duration predictor is "x.detach() + self.predictor_grad * (x - x.detach())".

    1. Why do we need detach()?
    2. Why do we need add self.predictor_grad * (x - x.detach()) since it always is zero?
    opened by wizardk 2
  • Reason for std and input scaling in cwt?

    Reason for std and input scaling in cwt?

    Hey, I have some questions about your pitch predictor in cwt domain:

    decoder_inp = decoder_inp.detach() + self.predictor_grad * (decoder_inp - decoder_inp.detach())
    pitch_padding = mel2ph == 0
    
    
    if self.pitch_type == "cwt":
        pitch_padding = None
        cwt = cwt_out = self.cwt_predictor(decoder_inp) * control
        stats_out = self.cwt_stats_layers(encoder_out[:, 0, :])  # [B, 2]
        mean = f0_mean = stats_out[:, 0]
        std = f0_std = stats_out[:, 1]
        cwt_spec = cwt_out[:, :, :10]
        if f0 is None:
            std = std * self.cwt_std_scale
            f0 = cwt2f0_norm(
    

    I have three questions:

    1. What is the reason for the first line? Isn't the right side always zero and therefore no gradients flow back?
    2. Why do you scale inputs by 0.1?
    3. Why did you scale ground truth std by 0.8?

    Thanks for any help in advance!

    opened by dunky11 2
  • How about the inference speed of those different block types?

    How about the inference speed of those different block types?

    This is really a great project. There are many types of block in this project, you mentioned the differences of memory cost and the trining speed, could you further introduce the inference speed of every block type? Thank you!

    opened by blx0102 2
  • Voice embeddings: Can VCTK embeddings be used after training on non VCTK data?

    Voice embeddings: Can VCTK embeddings be used after training on non VCTK data?

    Voice embeddings: Can VCTK embeddings be used after training on non VCTK data?

    I'm looking to see if I can train Cherokee, but then use VCTK voices go speak the Cherokee as donor voices.

    Would this be possible?

    opened by michael-conrad 2
  • RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

    RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

    Hi, Thanks for the great work. I met an error when training on the ryanspeech dataset:

    Traceback (most recent call last):
      File "train.py", line 254, in <module>
        train(0, args, configs, batch_size, num_gpus)
      File "train.py", line 110, in train
        losses = Loss(batch, output, step=step)
      File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 334, in forward
        pitch_loss = self.get_pitch_loss(pitch_predictions, pitch_targets)
      File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 197, in get_pitch_loss
        losses["uv"] = (F.binary_cross_entropy_with_logits(uv_pred, uv, reduction="none") * nonpadding) \
    RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1
    

    I printed the shape of both uv_pred and uv, and they were both [16, 1191].

    My configuration is

     ---> Automatic Mixed Precision: True
     ---> Number of used GPU: 1
     ---> Batch size per GPU: 16
     ---> Batch size in total: 16
     ---> Type of Building Block: conformer
     ---> Type of Duration Modeling: supervised
     ---> Type of Prosody Modeling: liu2021
    

    This happened at around 50k+ steps. What am I missing? Thank you!

    opened by godspirit00 0
  • Preprocess error

    Preprocess error

    █████████| 137/137 [18:49:05<00:00, 494.49s/it]
    Computing statistic quantities ...
    Traceback (most recent call last):
    File "preprocess.py", line 19, in
    preprocessor.build_from_path()
    File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 267, in build_from_path f0s_sup_stats = compute_f0_stats(f0s_sup)
    File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 145, in compute_f0_stats return (f0_mean, f0_std) UnboundLocalError: local variable 'f0_mean' referenced before assignment

    It crashed after a long time I run prerprocess For dataset,I use a dataset which simliar as VCTK but chinese,and It did not throw any error before this step. Could anyone can help me?

    opened by Stardust-minus 0
  • Unvoiced loss is too high for me.

    Unvoiced loss is too high for me.

    Hello, I'm trying to train TTS model with frame level pitch prediction(not cwt) for my spanish dataset.

    First, I made a small modification for training. before var_start_steps, I just detach encoder input from variational predictor instead of setting init_losses like below.

    if step < self.config.var_start_steps:
        x_pitch = x.detach()       
    else:
        x_pitch = x
    
    pitch_prediction, pitch_embed = self.get_pitch_embedding(x_pitch, pitch_target, uv_target, pitch_control)
    

    When I do this and observe losses, synthesized mel spectrogram by ground truth pitch, uv, dur was perfect, and pitch, dur loss descends satisfactorily even before var_start_steps.

    But only Unvoiced loss starts from 0.9 and descends too slowly(and not descends before var_start_steps)in my case. (for 50k ~ 100k steps) And the mel synthesized in eval state have no pitch(uv is almost 1) .

    Does anyone had the same problem like me? Any help would be appreciated.

    Preprocessed data in my dataset is like below. I think grount truth unvoiced segment have no problem. image

    Thank you.

    opened by LEECHOONGHO 0
  • Bump pillow from 8.3.1 to 8.3.2

    Bump pillow from 8.3.1 to 8.3.2

    Bumps pillow from 8.3.1 to 8.3.2.

    Release notes

    Sourced from pillow's releases.

    8.3.2

    https://pillow.readthedocs.io/en/stable/releasenotes/8.3.2.html

    Security

    • CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

    • Fix 6-byte OOB read in FliDecode [wiredfool]

    Python 3.10 wheels

    • Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

    Fixed regressions

    • Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

    • Updates for ImagePalette channel order #5599 [radarhere]

    • Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

    Changelog

    Sourced from pillow's changelog.

    8.3.2 (2021-09-02)

    • CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

    • Fix 6-byte OOB read in FliDecode [wiredfool]

    • Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

    • Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

    • Updates for ImagePalette channel order #5599 [radarhere]

    • Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

    Commits
    • 8013f13 8.3.2 version bump
    • 23c7ca8 Update CHANGES.rst
    • 8450366 Update release notes
    • a0afe89 Update test case
    • 9e08eb8 Raise ValueError if color specifier is too long
    • bd5cf7d FLI tests for Oss-fuzz crash.
    • 94a0cf1 Fix 6-byte OOB read in FliDecode
    • cece64f Add 8.3.2 (2021-09-02) [CI skip]
    • e422386 Add release notes for Pillow 8.3.2
    • 08dcbb8 Pillow 8.3.2 supports Python 3.10 [ci skip]
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump tensorflow from 2.5.0 to 2.5.1

    Bump tensorflow from 2.5.0 to 2.5.1

    Bumps tensorflow from 2.5.0 to 2.5.1.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.1

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)
    • Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)
    • Fixes a division by 0 in inplace operations (CVE-2021-37660)
    • Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)
    • Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)
    • Fixes a heap OOB in boosted trees (CVE-2021-37664)
    • Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)
    • Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)
    • Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)
    • Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)
    • Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)
    • Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)
    • Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)
    • Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)
    • Fixes a CHECK-fail in MapStage (CVE-2021-37673)
    • Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)
    • Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)
    • Fixes a division by 0 in most convolution operators (CVE-2021-37675)
    • Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)
    • Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)
    • Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

    ... (truncated)

    Commits
    • 8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build
    • d584260 Disable broken/flaky test
    • f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468
    • 3ca7812 Update version numbers to 2.5.1
    • 4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5
    • 05fc01a Put CVE numbers for fixes in parentheses
    • bee1dc4 Update release notes for the new patch release
    • 47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel
    • 6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5
    • 0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Prosody Loss

    Prosody Loss

    Hi, I am adding your MDN prosody modeling code segment to my tacotron but I encountered several problems about the code segment about prosody modeling. First, the prosody loss is added into the total loss only after the prosody_loss_enable_steps but in the training steps before the prosody_loss_enable_steps the prosody representation is already added with the text encoding. Does it means in the training steps before the prosody_loss_enable_steps, the prosody representation is optimized without the prosody loss? Second, in the training steps, the backward gradient of training prosody predictor should be acted like "stop gradient" but it seems little relevant code. Thanks!

    eeObXqHdtF

    opened by inconnu11 7
  • An errors with running the preprocess.py

    An errors with running the preprocess.py

    I'm trying to preprocess the VCTK dataset, and stuck on the 'Computing statistic quantities' step. When I copy from repo preprocessed_data files instead, the training run successful.

    Firstly, there is a runtime error:

    preprocessor.py

    625: cont_lf0_lpf_norm = (cont_lf0_lpf - logf0s_mean_org) / logf0s_std_org
    RuntimeWarning: invalid value encountered in true_divide
    

    After applying a simple crutch to fix a value of logf0s_std_org, next error appear:

    165: energy_mean = energy_scaler.mean_[0]
    'StandardScaler' object has no attribute 'mean_'
    

    win 10 conda python 3.6.15 all packages from the requirements.txt is installed

    opened by dasstyx 2
  • Multi-GPU training could not work normally?

    Multi-GPU training could not work normally?

    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; As suggested, I modified the model by adding find_unused_parameters=True, as followings, model = DistributedDataParallel(model, device_ids=[rank], find_unused_parameters=True).to(device), but I still got the same errors, could you train normally when with multi GPU? Any suggestions to fix this? Many Thanks.

    opened by GuangChen2016 1
  • Problem with Utterance-level Prosody extractor of DelightfulTTS

    Problem with Utterance-level Prosody extractor of DelightfulTTS

    I've recently been experimenting with your implementation of DelightfulTTS and the voice quality is awesome. However I found out that the embedding vector output of Utterance-level Prosody extractor is very small, making the that of Utterance-level Prosody predictor small as well (L2 is roughly 12 and each element in the vector is roughly 0.2 to 0.3). Vectors with element close to zero means this layer mostly doesn't add any information at all. Have you find any solution to this?

    opened by vietvq-vbee 3
  • New TTS Model request

    New TTS Model request

    opened by rishikksh20 19
Releases(v0.2.1)
  • v0.2.1(Mar 6, 2022)

    Fix and update codebase & pre-trained models with demo samples

    1. Fix variance adaptor to make it work with all combinations of building block and variance type/level
    2. Update pre-trained models with demo samples of LJSpeech and VCTK under "transformer_fs2" building block and "cwt" pitch conditioning
    3. Share the result of ablation studies of comparing "transformer" vs. "transformer_fs2" paired among three types of pitch conditioning ("frame", "ph", and "cwt")
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Feb 18, 2022)

    A lot of improvements with new features!

    1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings

    2. Adopt wavelet for pitch modeling & loss

    3. Add fine-trained duration loss

    4. Apply var_start_steps for better model convergence, especially under unsupervised duration modeling

    5. Remove dependency of energy modeling on pitch variance

    6. Add "transformer_fs2" building block, which is more close to the original FastSpeech2 paper

    7. Add two types of prosody modeling methods

    8. Loss camparison on validation set:

      • LJSpeech - blue: v0.1.1 / green: v0.2.0

      • VCTK - skyblue: v0.1.1 / orange: v0.2.0

    Source code(tar.gz)
    Source code(zip)
Owner
Keon Lee
Expressive Speech Synthesis | Conversational AI | Open-domain Dialog | NLP | Generative Models | Empathic Computing | HCI
Keon Lee
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 5, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

null 760 Jan 3, 2023
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

null 84 Dec 15, 2022
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and ?? Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 1, 2023
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022