Official implementation of Meta-StyleSpeech and StyleSpeech

min95

Last update: Jan 5, 2023

Related tags

Text Data & NLP text-to-speech tts official meta-learning neural-tts stylespeech meta-stylespeech

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

This is an official code for our recent paper. We propose Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. We provide our implementation and pretrained models as open source in this repository.

Abstract : With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Demo audio samples are avaliable demo page.

Recent Updates

Few modifications on the Variance Adaptor wich were found to improve the quality of the model . 1) We replace the architecture of variance emdedding from one Conv1D layer to two Conv1D layers followed by a linear layer. 2) We add a layernorm and phoneme-wise positional encoding. Please refer to here.

Getting the pretrained models

Model	Link to the model
Meta-StyleSpeech	Link
StyleSpeech	Link

Prerequisites

Clone this repository.
Install python requirements. Please refer requirements.txt

Inference

You have to download pretrained models and prepared an audio for reference speech sample.

python synthesize.py --text <raw text to synthesize> --ref_audio <path to referecne speech audio> --checkpoint_path <path to pretrained model>

The generated mel-spectrogram will be saved in results/ folder.

Preprocessing the dataset

Our models are trained on LibriTTS dataset. Download, extract and place it in the dataset/ folder.

To preprocess the dataset : First, run

python prepare_align.py

to resample audios to 16kHz and for some other preperations.

Second, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.

./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt  english datset/TextGrid/ -j 10 -v

Third, preprocess the dataset to prepare mel-spectrogram, duration, pitch and energy for fast training.

python preprocess.py

Train!

Train the StyleSpeech from the scratch with

python train.py

Train the Meta-StyleSpeech from pretrained StyleSpeech with

python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>

Acknowledgements

We refered to

Comments

Why 16000hz sampling rate has been choosed for this research ?

I have tried to train a lot of model with sampling rate 22050 but, it can not reproduce quality of 16000 hz model. Can you explain why you use 16000 in your research ?

opened by chazo1994 8
synthesized audio is not similar with reference audio

Dear author @KevinMIN95

Thank you for sharing the interesting project. I use pretrained model (Stylespeech and Meta-Stylespeech) and Melgan (pretrained model). I also use the same people in the page (https://stylespeech.github.io/) to evaluate Trained Speaker and Unseen Speaker. However, the synthesized audio is not good as you report. Could you give me some advice to reproduce the similar result that you report. Best Regard

opened by tranquangchung 6
Cannot Reproduce quality of pretrained model

I have trained a stylespeech model use LibriTTS, but the quality was far worse than pretrain stylespeech model of author. I use default config and parameter and train the model within 100k step. The loss like bellow:

I also upload my audio sample of the text same as demo page in folder Train_LibriTTS_StyleSpeech in attached fille. There are always strange sounds at the end of each audio file, i can't explain that. meta_stylespeech_results.zip

opened by chazo1994 4
The Quality of Pretrained Melgan is not good

@KevinMIN95 I use this pretrained model: https://huggingface.co/Guan-Ting/StyleSpeech-MelGAN-vocoder-16kHz. And I also use this repos https://github.com/descriptinc/melgan-neurips for inference, but the output audio sound very bad, output pitch may changed. 1463_infer.zip

@KevinMIN95 Could you please tag the author of this model if it necessary.

opened by chazo1994 2
How to better convert mel-spectrogram generated to audio

After running the inference command, the result is a png file showing spectrograms. So I tried to do some changes in the synthesize.py file to covert generated mel -spectrogram to audio file. I used librosa for this purpose. Below is the code snippet to do so. But the audio obtained is blank. Can you please help me out to convert the mel to proper audio? I am not yet an expert in this field. @KevinMIN95 your help is appreciated here. Thank you

opened by purnima291 2
MelGAN vocoder

As I understand, you train your own version of MelGAN for multi-speaker synthesis, as the official code supports the sampling rate of 22.05 kHz, while StyleSpeech operates at 16 kHz. Could you share the details for reproducibility purposes: which dataset did you use, which parameters did you change? Or you can maybe upload the trained vocoder itself? It would be great!

opened by vsgogoryan 2
Unseen Speaker Adaptation

For "unseen Speaker Adaptation" , did you refine the model using the data of target speaker, or just using MelStyleEncoder to get ws which adjust the output (like zero-shot/one-shot)?

opened by Liujingxiu23 2
Error when running train.py (models/VarianceAdaptor.py line 52)

When running train.py, models/VarianceAdaptor.py line 52: x = self.ln(x) + pitch_embedding + energy_embedding

returns an error. The shape of x seems to be [Batch_size, max_text_input_length, 256]. The shape of the other two seems to be [Batch_size, ??????, 256] (I don't know what pitch_embedding.shape(1) should be.)

Is there a solution to this?

opened by SeongYeonPark 1
How to improve the synthesized results?

I have trained the model for 200k steps, and still, the synthesised results are extremely bad. The sampling rate I have used is 22050 Hz and the batch size used is 16.
This is how my loss curve looks after 200k steps. Can you help me with what can I do now to improve my synthesized audio results?

opened by sanjeevani279 3
the audio quality is not good by using HiFi-GAN

Same as the title, I use the HiFi-GAN vocoder to generate the audio. But there is full of noice in the audio. How could you make the qualified audio as the demo page. Could you pls share some experinece.

Thanks a lot.

opened by 443127316 1
some bugs

https://github.com/KevinMIN95/StyleSpeech/blob/ddff11e680e7c1491df7733fa498aa07b0be1c5f/preprocessors/libritts.py#L130 the remove_outlier here return none and in line 51,52: list indices must be integers or slices, not list

opened by MrYANG23 2

Owner

min95

GitHub

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

105 Jan 4, 2023

Meta learning algorithms to train cross-lingual NLI (multi-task) models

4 Nov 20, 2022

This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

456 Jan 6, 2023

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

103 Dec 23, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

5.1k Dec 30, 2022

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

101 Dec 30, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models ???? Corpora ?? Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models ?? RoBERTa-base BNE: https://huggingface.co

203 Dec 20, 2022

Official implementation of Meta-StyleSpeech and StyleSpeech

Related tags

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

Recent Updates

Getting the pretrained models

Prerequisites

Inference

Preprocessing the dataset

Train!

Acknowledgements

Comments

Owner

min95

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Meta learning algorithms to train cross-lingual NLI (multi-task) models

This repository contains examples of Task-Informed Meta-Learning

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

StarGAN - Official PyTorch Implementation

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021