Official implementation of Meta-StyleSpeech and StyleSpeech

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

This is an official code for our recent paper. We propose Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. We provide our implementation and pretrained models as open source in this repository.

Abstract : With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Demo audio samples are avaliable demo page.


Recent Updates

Few modifications on the Variance Adaptor wich were found to improve the quality of the model . 1) We replace the architecture of variance emdedding from one Conv1D layer to two Conv1D layers followed by a linear layer. 2) We add a layernorm and phoneme-wise positional encoding. Please refer to here.

Getting the pretrained models

Model Link to the model
Meta-StyleSpeech Link
StyleSpeech Link

Prerequisites

  • Clone this repository.
  • Install python requirements. Please refer requirements.txt

Inference

You have to download pretrained models and prepared an audio for reference speech sample.

python synthesize.py --text <raw text to synthesize> --ref_audio <path to referecne speech audio> --checkpoint_path <path to pretrained model>

The generated mel-spectrogram will be saved in results/ folder.

Preprocessing the dataset

Our models are trained on LibriTTS dataset. Download, extract and place it in the dataset/ folder.

To preprocess the dataset : First, run

python prepare_align.py 

to resample audios to 16kHz and for some other preperations.

Second, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.

./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt  english datset/TextGrid/ -j 10 -v

Third, preprocess the dataset to prepare mel-spectrogram, duration, pitch and energy for fast training.

python preprocess.py

Train!

Train the StyleSpeech from the scratch with

python train.py 

Train the Meta-StyleSpeech from pretrained StyleSpeech with

python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>

Acknowledgements

We refered to

Comments
  • Why 16000hz sampling rate has been choosed for this research ?

    Why 16000hz sampling rate has been choosed for this research ?

    I have tried to train a lot of model with sampling rate 22050 but, it can not reproduce quality of 16000 hz model. Can you explain why you use 16000 in your research ?

    opened by chazo1994 8
  • synthesized audio is not similar with reference audio

    synthesized audio is not similar with reference audio

    Dear author @KevinMIN95

    Thank you for sharing the interesting project. I use pretrained model (Stylespeech and Meta-Stylespeech) and Melgan (pretrained model). I also use the same people in the page (https://stylespeech.github.io/) to evaluate Trained Speaker and Unseen Speaker. However, the synthesized audio is not good as you report. Could you give me some advice to reproduce the similar result that you report. Best Regard

    opened by tranquangchung 6
  • Cannot Reproduce quality of pretrained model

    Cannot Reproduce quality of pretrained model

    I have trained a stylespeech model use LibriTTS, but the quality was far worse than pretrain stylespeech model of author. I use default config and parameter and train the model within 100k step. The loss like bellow: image image

    I also upload my audio sample of the text same as demo page in folder Train_LibriTTS_StyleSpeech in attached fille. There are always strange sounds at the end of each audio file, i can't explain that. meta_stylespeech_results.zip

    opened by chazo1994 4
  • The Quality of Pretrained Melgan is not good

    The Quality of Pretrained Melgan is not good

    @KevinMIN95 I use this pretrained model: https://huggingface.co/Guan-Ting/StyleSpeech-MelGAN-vocoder-16kHz. And I also use this repos https://github.com/descriptinc/melgan-neurips for inference, but the output audio sound very bad, output pitch may changed. 1463_infer.zip

    @KevinMIN95 Could you please tag the author of this model if it necessary.

    opened by chazo1994 2
  • How to better convert mel-spectrogram generated to audio

    How to better convert mel-spectrogram generated to audio

    After running the inference command, the result is a png file showing spectrograms. So I tried to do some changes in the synthesize.py file to covert generated mel -spectrogram to audio file. I used librosa for this purpose. Below is the code snippet to do so. Screenshot from 2021-08-11 22-20-24 But the audio obtained is blank. Can you please help me out to convert the mel to proper audio? I am not yet an expert in this field. @KevinMIN95 your help is appreciated here. Thank you

    opened by purnima291 2
  • MelGAN vocoder

    MelGAN vocoder

    As I understand, you train your own version of MelGAN for multi-speaker synthesis, as the official code supports the sampling rate of 22.05 kHz, while StyleSpeech operates at 16 kHz. Could you share the details for reproducibility purposes: which dataset did you use, which parameters did you change? Or you can maybe upload the trained vocoder itself? It would be great!

    opened by vsgogoryan 2
  • Unseen Speaker Adaptation

    Unseen Speaker Adaptation

    For "unseen Speaker Adaptation" , did you refine the model using the data of target speaker, or just using MelStyleEncoder to get ws which adjust the output (like zero-shot/one-shot)?

    opened by Liujingxiu23 2
  • Error when running train.py (models/VarianceAdaptor.py line 52)

    Error when running train.py (models/VarianceAdaptor.py line 52)

    When running train.py, models/VarianceAdaptor.py line 52: x = self.ln(x) + pitch_embedding + energy_embedding

    returns an error. The shape of x seems to be [Batch_size, max_text_input_length, 256]. The shape of the other two seems to be [Batch_size, ??????, 256] (I don't know what pitch_embedding.shape(1) should be.)

    Is there a solution to this?

    opened by SeongYeonPark 1
  • How to improve the synthesized results?

    How to improve the synthesized results?

    I have trained the model for 200k steps, and still, the synthesised results are extremely bad. The sampling rate I have used is 22050 Hz and the batch size used is 16.
    loss_curve This is how my loss curve looks after 200k steps. Can you help me with what can I do now to improve my synthesized audio results?

    opened by sanjeevani279 3
  • the audio quality is not good by using HiFi-GAN

    the audio quality is not good by using HiFi-GAN

    Same as the title, I use the HiFi-GAN vocoder to generate the audio. But there is full of noice in the audio. How could you make the qualified audio as the demo page. Could you pls share some experinece.

    Thanks a lot.

    opened by 443127316 1
  • some bugs

    some bugs

    https://github.com/KevinMIN95/StyleSpeech/blob/ddff11e680e7c1491df7733fa498aa07b0be1c5f/preprocessors/libritts.py#L130 the remove_outlier here return none image and in line 51,52: list indices must be integers or slices, not list

    opened by MrYANG23 2
Owner
min95
min95
Official implementation of "MetaSDF: Meta-learning Signed Distance Functions"

MetaSDF: Meta-learning Signed Distance Functions Project Page | Paper | Data Vincent Sitzmann*, Eric Ryan Chan*, Richard Tucker, Noah Snavely Gordon W

Vincent Sitzmann 100 Jan 1, 2023
Official Pytorch implementation of Meta Internal Learning

Official Pytorch implementation of Meta Internal Learning

null 10 Aug 24, 2022
This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

Wonyong Jeong 15 Nov 21, 2022
This is an unofficial PyTorch implementation of Meta Pseudo Labels

This is an unofficial PyTorch implementation of Meta Pseudo Labels. The official Tensorflow implementation is here.

Jungdae Kim 320 Jan 8, 2023
The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Prior-Enhanced network with Meta-Prototypes (PEMP) This is the PyTorch implementation of PEMP. Overview of PEMP Meta-Prototypes & Adaptive Prototypes

Jianwei ZHANG 8 Oct 14, 2021
PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

pytorch-maml This is a PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML): https://arxiv

Kate Rakelly 516 Jan 5, 2023
PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

PAML PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021. (Continuously updating ) Int

null 15 Nov 18, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

THUNLP 5 Jun 16, 2022
EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections Ruiqi Zhong, Kristy Lee*, Zheng Zhang*, Dan Klein EMN

Ruiqi Zhong 42 Nov 3, 2022
CNN Based Meta-Learning for Noisy Image Classification and Template Matching

CNN Based Meta-Learning for Noisy Image Classification and Template Matching Introduction This master thesis used a few-shot meta learning approach to

Kumar Manas 2 Dec 9, 2021
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.

Jittor: a Just-in-time(JIT) deep learning framework Quickstart | Install | Tutorial | Chinese Jittor is a high-performance deep learning framework bas

null 2.7k Jan 3, 2023
Meta Learning Backpropagation And Improving It (VSML)

Meta Learning Backpropagation And Improving It (VSML) This is research code for the NeurIPS 2021 publication Kirsch & Schmidhuber 2021. Many concepts

Louis Kirsch 22 Dec 21, 2022
🐦 Opytimizer is a Python library consisting of meta-heuristic optimization techniques.

Opytimizer: A Nature-Inspired Python Optimizer Welcome to Opytimizer. Did you ever reach a bottleneck in your computational experiments? Are you tired

Gustavo Rosa 546 Dec 31, 2022
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
Supplementary code for the paper "Meta-Solver for Neural Ordinary Differential Equations" https://arxiv.org/abs/2103.08561

Meta-Solver for Neural Ordinary Differential Equations Towards robust neural ODEs using parametrized solvers. Main idea Each Runge-Kutta (RK) solver w

Julia Gusak 25 Aug 12, 2021
Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Graph Evolving Meta-Learning for Low-resource Medical Dialogue Generation Code to be further cleaned... This repo contains the code of the following p

Shuai Lin 29 Nov 1, 2022
Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

Zirui Wang 20 Feb 13, 2022
DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates

DeepMetaHandles (CVPR2021 Oral) [paper] [animations] DeepMetaHandles is a shape deformation technique. It learns a set of meta-handles for each given

Liu Minghua 73 Dec 15, 2022