STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Overview

Hits

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Keon Lee, Kyumin Park, Daeyoung Kim

In our paper, we propose STYLER, a non-autoregressive TTS framework with style factor modeling that achieves rapidity, robustness, expressivity, and controllability at the same time.

Abstract: Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability. Although several approaches resolve some limitations, there has been no attempt to solve all weaknesses at once. In this paper, we propose STYLER, an expressive and controllable TTS framework with high-speed and robust synthesis. Our novel audio-text aligning method called Mel Calibrator and excluding autoregressive decoding enable rapid training and inference and robust synthesis on unseen data. Also, disentangled style factor modeling under supervision enlarges the controllability in synthesizing process leading to expressive TTS. On top of it, a novel noise modeling pipeline using domain adversarial training and Residual Decoding empowers noise-robust style transfer, decomposing the noise without any additional label. Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive TTS. Synthesis samples and experiment results are provided via our demo page, and code is available publicly.

Dependencies

Please install the python dependencies given in requirements.txt.

pip3 install -r requirements.txt

Training

Preparation

Clean Data

  1. Download VCTK dataset and resample audios to a 22050Hz sampling rate.
  2. We provide a bash script for the resampling. Refer to data/resample.sh for the detail.
  3. Put audio files and corresponding text (transcript) files in the same directory. Both audio and text files must have the same name, excluding the extension.
  4. You may need to trim the audio for stable model convergence. Refer to Yeongtae's preprocess_audio.py for helpful preprocessing, including the trimming.
  5. Modify the hp.data_dir in hparams.py.

Noisy Data

  1. Download WHAM! dataset and resample audios to a 22050Hz sampling rate.
  2. Modify the hp.noise_dir in hparams.py.

Vocoder

  1. Unzip hifigan/generator_universal.pth.tar.zip in the same directory.

Preprocess

First, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding as described in our paper and locate it in hp.speaker_embedder_dir.

Second, download the Montreal Forced Aligner(MFA) package and the pretrained (LibriSpeech) lexicon file through the following commands. MFA is used to obtain the alignments between the utterances and the phoneme sequences as FastSpeech2.

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then, process all the necessary features. You will get a stat.txt file in your hp.preprocessed_path/. You have to modify the f0 and energy parameters in the hparams.py according to the content of stat.txt.

python3 preprocess.py

Finally, get the noisy data separately from the clean data by mixing each utterance with a randomly selected piece of background noise from WHAM! dataset.

python3 preprocess_noisy.py

Train

Now you have all the prerequisites! Train the model using the following command:

python3 train.py

Inference

Prepare Texts

Create sentences.py in data/ which has a python list named sentences of texts to be synthesized. Note that sentences can contain more than one text.

# In 'data/sentences.py',
sentences = [
    "Nothing is lost, everything is recycled."
]

Prepare Reference Audios

Reference audio preparation has a similar process to training data preparation. There could be two kinds of references: clean and noisy.

First, put clean audios with corresponding texts in a single directory and modify the hp.ref_audio_dir in hparams.py and process all the necessary features. Refer to the Clean Data section of Train Preparation.

python3 preprocess_refs.py

Then, get the noisy references.

python3 preprocess_noisy.py --refs

Synthesize

The following command will synthesize all combinations of texts in data/sentences.py and audios in hp.ref_audio_dir.

python3 synthesize.py --ckpt CHECKPOINT_PATH

Or you can specify single reference audio in hp.ref_audio_dir as follows.

python3 synthesize.py --ckpt CHECKPOINT_PATH --ref_name AUDIO_FILENAME

Also, there are several useful options.

  1. --speaker_id will specify the speaker. The specified speaker's embedding should be in hp.preprocessed_path/spker_embed. The default value is None, and the speaker embedding is calculated at runtime on each input audio.

  2. --inspection will give you additional outputs that show the effects of each encoder of STYLER. The samples are the same as the Style Factor Modeling section on our demo page.

  3. --cont will generate the samples as the Style Factor Control section on our demo page.

    python3 synthesize.py --ckpt CHECKPOINT_PATH --cont --r1 AUDIO_FILENAME_1 --r2 AUDIO_FILENAME_1

    Note that --cont option is only working on preprocessed data. In detail, the audios' name should have the same format as VCTK dataset (e.g., p323_229), and the preprocessed data must be existing in hp.preprocessed_path.

TensorBoard

The TensorBoard loggers are stored in the log directory. Use

tensorboard --logdir log

to serve the TensorBoard on your localhost. Here are some logging views of the model training on VCTK for 560k steps.

Notes

  1. There were too many noise data where extraction was not possible through pyworld as in clean data. To resolve this, pysptk was applied to extract log f0 for the noisy data's fundamental frequency. The --noisy_input option will automate this process during synthesizing.

  2. If MFA-related problems occur during running preprocess.py, try to manually run MFA by the following command.

    # Replace $data_dir and $PREPROCESSED_PATH with ./VCTK-Corpus-92/wav48_silence_trimmed and ./preprocessed/VCTK/TextGrid, for example
    ./montreal-forced-aligner/bin/mfa_align $YOUR_data_dir montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8
  3. DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding in our experiments.

  4. Currently, preprocess.py divides the dataset into two subsets: train and validation set. If you need other sets, such as a test set, the only thing to do is modifying the text files (train.txt or val.txt) in hp.preprocessed_path/.

Citation

If you would like to use or refer to this implementation, please cite our paper with the repo.

@article{lee2021styler,
  title={STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech},
  author={Lee, Keon and Park, Kyumin and Kim, Daeyoung},
  journal={arXiv preprint arXiv:2103.09474},
  year={2021}
}

References

Comments
  • some questions

    some questions

    1.in paper, why output of encoder (text_encoding) upsample and downsample? 2. what is the meaning of text_encoding_neck+pitch_encoding、text_encoding_neck+energy_encoding? why not cat?

    opened by Pydataman 3
  • Low resource languages that won't work with MFA?

    Low resource languages that won't work with MFA?

    Is there a way to fine tune a model or training two languages side by side such that a very low resource language can be trained with the voices of a high resource language?

    opened by michael-conrad 3
  • Undefined names

    Undefined names

    Hi, I noticed some undefined names around the code:

    synthesize.py:495:67: F821 undefined name 'reference' noise_mixer_refs.py:56:42: F821 undefined name 'eps' noise_mixer_refs.py:59:40: F821 undefined name 'eps'

    opened by L3str4nge 2
  • About the pre-process

    About the pre-process

    Hi, I want to ask the trimming operation whether is very important for training your model? Furthermore, can you share the scripts to trimming VCTK dataset?

    opened by yangdongchao 0
  • Bump tensorflow from 2.4.0 to 2.5.1

    Bump tensorflow from 2.4.0 to 2.5.1

    Bumps tensorflow from 2.4.0 to 2.5.1.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.1

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)
    • Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)
    • Fixes a division by 0 in inplace operations (CVE-2021-37660)
    • Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)
    • Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)
    • Fixes a heap OOB in boosted trees (CVE-2021-37664)
    • Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)
    • Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)
    • Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)
    • Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)
    • Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)
    • Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)
    • Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)
    • Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)
    • Fixes a CHECK-fail in MapStage (CVE-2021-37673)
    • Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)
    • Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)
    • Fixes a division by 0 in most convolution operators (CVE-2021-37675)
    • Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)
    • Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)
    • Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

    ... (truncated)

    Commits
    • 8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build
    • d584260 Disable broken/flaky test
    • f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468
    • 3ca7812 Update version numbers to 2.5.1
    • 4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5
    • 05fc01a Put CVE numbers for fixes in parentheses
    • bee1dc4 Update release notes for the new patch release
    • 47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel
    • 6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5
    • 0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(v1.0.0)
Owner
Keon Lee
Expressive Speech Synthesis | Disentangled Representation | Generative Models | NLP | HCI
Keon Lee
Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).

arXiv, porject page, paper Blind Image Decomposition (BID) Blind Image Decomposition is a novel task. The task requires separating a superimposed imag

null 64 Dec 20, 2022
Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

Brian Alejandro 1 Feb 13, 2022
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

Despoina Paschalidou 161 Dec 20, 2022
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Ren Yurui 261 Jan 9, 2023
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Website | ArXiv | Get Start | Video PIRenderer The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic

Ren Yurui 81 Sep 25, 2021
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Keon Lee 47 Dec 18, 2022
Autolfads-tf2 - A TensorFlow 2.0 implementation of Latent Factor Analysis via Dynamical Systems (LFADS) and AutoLFADS

autolfads-tf2 A TensorFlow 2.0 implementation of LFADS and AutoLFADS. Installati

Systems Neural Engineering Lab 11 Oct 29, 2022
Fast Neural Style for Image Style Transform by Pytorch

FastNeuralStyle by Pytorch Fast Neural Style for Image Style Transform by Pytorch This is famous Fast Neural Style of Paper Perceptual Losses for Real

Bengxy 81 Sep 3, 2022
Expressive Power of Invariant and Equivaraint Graph Neural Networks (ICLR 2021)

Expressive Power of Invariant and Equivaraint Graph Neural Networks In this repository, we show how to use powerful GNN (2-FGNN) to solve a graph alig

Marc Lelarge 36 Dec 12, 2022
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022
Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Weakly Supervised Text-to-SQL Parsing through Question Decomposition The official repository for the paper "Weakly Supervised Text-to-SQL Parsing thro

null 14 Dec 19, 2022
NeRD: Neural Reflectance Decomposition from Image Collections

NeRD: Neural Reflectance Decomposition from Image Collections Project Page | Video | Paper | Dataset Implementation for NeRD. A novel method which dec

Computergraphics (University of Tübingen) 195 Dec 29, 2022
This is the code repository implementing the paper "TreePartNet: Neural Decomposition of Point Clouds for 3D Tree Reconstruction".

TreePartNet This is the code repository implementing the paper "TreePartNet: Neural Decomposition of Point Clouds for 3D Tree Reconstruction". Depende

刘彦超 34 Nov 30, 2022
Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

null 444 Dec 30, 2022
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image [Project Page] [Paper] [Supp. Mat.] Table of Contents License Description Fittin

Vassilis Choutas 1.3k Jan 7, 2023
Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization Official PyTorch implementation for our URST (Ultra-Resolution Sty

czczup 148 Dec 27, 2022
Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

null 123 Dec 27, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023