Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Sung-Feng Huang

Last update: Dec 25, 2022

Related tags

Overview

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

This repository is the official implementation of "Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech".

multi-task learning	meta learning

Meta-TTS

Requirements

This is how I build my environment, which is not exactly needed to be the same:

Sign up for Comet.ml, find out your workspace and API key via www.comet.ml/api/my/settings and fill them in config/comet.py. Comet logger is used throughout train/val/test stages.
- Check my training logs here.
[Optional] Install pyenv for Python version control, change to Python 3.8.6.

# After download and install pyenv:
pyenv install 3.8.6
pyenv local 3.8.6

[Optional] Install pyenv-virtualenv as a plugin of pyenv for clean virtual environment.

# After install pyenv-virtualenv
pyenv virtualenv meta-tts
pyenv activate meta-tts

Install learn2learn from source.

# Install Cython first:
pip install cython

# Then install learn2learn from source:
git clone https://github.com/learnables/learn2learn.git
cd learn2learn
pip install -e .

Install requirements:

pip install -r requirements.txt

Proprocessing

First, download LibriTTS and VCTK, then change the paths in config/LibriTTS/preprocess.yaml and config/VCTK/preprocess.yaml, then run

python3 prepare_align.py config/LibriTTS/preprocess.yaml
python3 prepare_align.py config/VCTK/preprocess.yaml

for some preparations.

Alignments of LibriTTS is provided here, and the alignments of VCTK is provided here. You have to unzip the files into preprocessed_data/LibriTTS/TextGrid/ and preprocessed_data/VCTK/TextGrid/.

Then run the preprocessing script:

python3 preprocess.py config/LibriTTS/preprocess.yaml

# Copy stats from LibriTTS to VCTK to keep pitch/energy normalization the same shift and bias.
cp preprocessed_data/LibriTTS/stats.json preprocessed_data/VCTK/

python3 preprocess.py config/VCTK/preprocess.yaml

Training

To train the models in the paper, run this command:

python3 main.py -s train \
                -p config/preprocess/<corpus>.yaml \
                -m config/model/base.yaml \
                -t config/train/base.yaml config/train/<corpus>.yaml \
                -a config/algorithm/<algorithm>.yaml

To reproduce, please use 8 V100 GPUs for meta models, and 1 V100 GPU for baseline models, or else you might need to tune gradient accumulation step (grad_acc_step) setting in config/train/base.yaml to get the correct meta batch size. Note that each GPU has its own random seed, so even the meta batch size is the same, different number of GPUs is equivalent to different random seed.

After training, you can find your checkpoints under output/ckpt/ / / /checkpoints/, where the project name is set in config/comet.py.

To inference the models, run:

python3 main.py -s test \
                -p config/preprocess/<corpus>.yaml \
                -m config/model/base.yaml \
                -t config/train/base.yaml config/train/<corpus>.yaml \
                -a config/algorithm/<algorithm>.yaml \
                -e <experiment_key> -c <checkpoint_file_name>

and the results would be under output/result/ / / /.

Evaluation

Note: The evaluation code is not well-refactored yet.

cd evaluation/ and check README.md

Pre-trained Models

Note: The checkpoints are with older version, might not capatiable with the current code. We would fix the problem in the future.

Since our codes are using Comet logger, you might need to create a dummy experiment by running:

from comet_ml import Experiment
experiment = Experiment()

then put the checkpoint files under output/ckpt/LibriTTS/ / /checkpoints/.

You can download pretrained models here.

Results

Corpus	LibriTTS	VCTK
Speaker Similarity
Speaker Verification
Synthesized Speech Detection

Comments

Question for Meta-update in Meta-TTS

I have a question about meta update in the outer loop of Meta TTS

I compare the algorithm of MetaTTS and the original of MAML, are they the same in the below red box

I can not find the sum of query loss of all tasks in the repo, can you help to show!

I refer to the implementation of meta-learning for mnist, it seems also sum query loss of all tasks: https://github.com/learnables/learn2learn/blob/0b9d3a3d540646307ca5debf8ad9c79ffe975e1c/examples/vision/meta_mnist.py#L100

One more question is the MAML from L2L, what is the "out of space" mean:

Thank you SungFeng

opened by v-nhandt21 4
8gpu CUDA out of memory

I used 8 GPUs for training, and set shots and queries to 3, meta batchsizes is set to 8. And set batchsize in config/train/base.yaml is set to 48, grad_ acc_ Step is set to 8. But when the code runs, it still reports CUDA out of memory. What else do I need to change?

opened by tuntun990606 3
LibriTTS-360 pretrained checkpoints

Hello, thanks for great work on MetaTTS paper and repo!

I've noticed there are "dev" configs for training on larger dataset including LibriTTS 360 and 500 subets. Do you have plans for releasing pretrained checkpoint and publishing results obtained with more data?

opened by arampacha 2
What's the difference between the models in the "pretrained model" link?

Hi. The work is amazing. If I want to test the model, which pretrained model should I use? (Because I notice that the filenames in the pretrained model link are quite similar) I also notice that in the demo page (Section 4.3), you only did parallel voice cloning with unseen speakers, have you tried testing with different text with these speakers? Thank you very much.

opened by Charlottecuc 2
Have you tried to finetune only a few parameters?

Hi, thanks for your amazing work! I noticed you finetune the whole decoder with other layers in your experiment. Have you ever tried to finetune only a few parameters? For example，only the last layer of the model? I want to know how it performs with very few trainable parameters.

opened by hoyden 1

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Related tags

Overview

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Meta-TTS

Requirements

Proprocessing

Training

Evaluation

Pre-trained Models

Results

Comments

Question for Meta-update in Meta-TTS

8gpu CUDA out of memory

LibriTTS-360 pretrained checkpoints

What's the difference between the models in the "pretrained model" link?

Have you tried to finetune only a few parameters?

Owner

Sung-Feng Huang

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-NERD: Not Only a Few-shot NER Dataset

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Library of various Few-Shot Learning frameworks for text classification

An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

African language Speech Recognition - Speech-to-Text

Few-shot Learning of GPT-3

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Related tags

Overview

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Meta-TTS

Requirements

Proprocessing

Training

Evaluation

Pre-trained Models

Results

Comments

Question for Meta-update in Meta-TTS

8gpu CUDA out of memory

LibriTTS-360 pretrained checkpoints

What's the difference between the models in the "pretrained model" link?

Have you tried to finetune only a few parameters?

Owner

Sung-Feng Huang

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-NERD: Not Only a Few-shot NER Dataset

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Library of various Few-Shot Learning frameworks for text classification

An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

African language Speech Recognition - Speech-to-Text

Few-shot Learning of GPT-3

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,