Official implementation of Meta-StyleSpeech and StyleSpeech

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

This is an official code for our recent paper. We propose Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. We provide our implementation and pretrained models as open source in this repository.

Abstract : With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Demo audio samples are avaliable demo page.


Recent Updates

Few modifications on the Variance Adaptor wich were found to improve the quality of the model . 1) We replace the architecture of variance emdedding from one Conv1D layer to two Conv1D layers followed by a linear layer. 2) We add a layernorm and phoneme-wise positional encoding. Please refer to here.

Getting the pretrained models

Model Link to the model
Meta-StyleSpeech Link
StyleSpeech Link

Prerequisites

  • Clone this repository.
  • Install python requirements. Please refer requirements.txt

Inference

You have to download pretrained models and prepared an audio for reference speech sample.

python synthesize.py --text <raw text to synthesize> --ref_audio <path to referecne speech audio> --checkpoint_path <path to pretrained model>

The generated mel-spectrogram will be saved in results/ folder.

Preprocessing the dataset

Our models are trained on LibriTTS dataset. Download, extract and place it in the dataset/ folder.

To preprocess the dataset : First, run

python prepare_align.py 

to resample audios to 16kHz and for some other preperations.

Second, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.

./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt  english datset/TextGrid/ -j 10 -v

Third, preprocess the dataset to prepare mel-spectrogram, duration, pitch and energy for fast training.

python preprocess.py

Train!

Train the StyleSpeech from the scratch with

python train.py 

Train the Meta-StyleSpeech from pretrained StyleSpeech with

python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>

Acknowledgements

We refered to

Comments
  • Why 16000hz sampling rate has been choosed for this research ?

    Why 16000hz sampling rate has been choosed for this research ?

    I have tried to train a lot of model with sampling rate 22050 but, it can not reproduce quality of 16000 hz model. Can you explain why you use 16000 in your research ?

    opened by chazo1994 8
  • synthesized audio is not similar with reference audio

    synthesized audio is not similar with reference audio

    Dear author @KevinMIN95

    Thank you for sharing the interesting project. I use pretrained model (Stylespeech and Meta-Stylespeech) and Melgan (pretrained model). I also use the same people in the page (https://stylespeech.github.io/) to evaluate Trained Speaker and Unseen Speaker. However, the synthesized audio is not good as you report. Could you give me some advice to reproduce the similar result that you report. Best Regard

    opened by tranquangchung 6
  • Cannot Reproduce quality of pretrained model

    Cannot Reproduce quality of pretrained model

    I have trained a stylespeech model use LibriTTS, but the quality was far worse than pretrain stylespeech model of author. I use default config and parameter and train the model within 100k step. The loss like bellow: image image

    I also upload my audio sample of the text same as demo page in folder Train_LibriTTS_StyleSpeech in attached fille. There are always strange sounds at the end of each audio file, i can't explain that. meta_stylespeech_results.zip

    opened by chazo1994 4
  • The Quality of Pretrained Melgan is not good

    The Quality of Pretrained Melgan is not good

    @KevinMIN95 I use this pretrained model: https://huggingface.co/Guan-Ting/StyleSpeech-MelGAN-vocoder-16kHz. And I also use this repos https://github.com/descriptinc/melgan-neurips for inference, but the output audio sound very bad, output pitch may changed. 1463_infer.zip

    @KevinMIN95 Could you please tag the author of this model if it necessary.

    opened by chazo1994 2
  • How to better convert mel-spectrogram generated to audio

    How to better convert mel-spectrogram generated to audio

    After running the inference command, the result is a png file showing spectrograms. So I tried to do some changes in the synthesize.py file to covert generated mel -spectrogram to audio file. I used librosa for this purpose. Below is the code snippet to do so. Screenshot from 2021-08-11 22-20-24 But the audio obtained is blank. Can you please help me out to convert the mel to proper audio? I am not yet an expert in this field. @KevinMIN95 your help is appreciated here. Thank you

    opened by purnima291 2
  • MelGAN vocoder

    MelGAN vocoder

    As I understand, you train your own version of MelGAN for multi-speaker synthesis, as the official code supports the sampling rate of 22.05 kHz, while StyleSpeech operates at 16 kHz. Could you share the details for reproducibility purposes: which dataset did you use, which parameters did you change? Or you can maybe upload the trained vocoder itself? It would be great!

    opened by vsgogoryan 2
  • Unseen Speaker Adaptation

    Unseen Speaker Adaptation

    For "unseen Speaker Adaptation" , did you refine the model using the data of target speaker, or just using MelStyleEncoder to get ws which adjust the output (like zero-shot/one-shot)?

    opened by Liujingxiu23 2
  • Error when running train.py (models/VarianceAdaptor.py line 52)

    Error when running train.py (models/VarianceAdaptor.py line 52)

    When running train.py, models/VarianceAdaptor.py line 52: x = self.ln(x) + pitch_embedding + energy_embedding

    returns an error. The shape of x seems to be [Batch_size, max_text_input_length, 256]. The shape of the other two seems to be [Batch_size, ??????, 256] (I don't know what pitch_embedding.shape(1) should be.)

    Is there a solution to this?

    opened by SeongYeonPark 1
  • How to improve the synthesized results?

    How to improve the synthesized results?

    I have trained the model for 200k steps, and still, the synthesised results are extremely bad. The sampling rate I have used is 22050 Hz and the batch size used is 16.
    loss_curve This is how my loss curve looks after 200k steps. Can you help me with what can I do now to improve my synthesized audio results?

    opened by sanjeevani279 3
  • the audio quality is not good by using HiFi-GAN

    the audio quality is not good by using HiFi-GAN

    Same as the title, I use the HiFi-GAN vocoder to generate the audio. But there is full of noice in the audio. How could you make the qualified audio as the demo page. Could you pls share some experinece.

    Thanks a lot.

    opened by 443127316 1
  • some bugs

    some bugs

    https://github.com/KevinMIN95/StyleSpeech/blob/ddff11e680e7c1491df7733fa498aa07b0be1c5f/preprocessors/libritts.py#L130 the remove_outlier here return none image and in line 51,52: list indices must be integers or slices, not list

    opened by MrYANG23 2
Owner
min95
min95
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 4, 2023
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

null 10 Dec 19, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 6, 2023
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 612 Jan 4, 2023
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models ???? Corpora ?? Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models ?? RoBERTa-base BNE: https://huggingface.co

PlanTL-SANIDAD 203 Dec 20, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 6.4k Jan 2, 2023
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 5.2k Feb 12, 2021
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Stanford NLP 5.2k Feb 17, 2021
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 1, 2023
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

null 395 Jan 3, 2023