Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

General description
DONE
TODO
Getting Started
- Requirements
- Setup
Code structure description
Data Preprocessing
- Preparing for data preprocessing
- Run preprocessing
Training
Running TensorBoard
Inference
Parameters
Contributing

General description

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.
Checkpoints and code originate from following sources:

Done:

TODO:

make it work with pytorch-1.4.0
add multi-spot instance training for AWS

Getting Started

The following section lists the requirements in order to start training the Tacotron 2 and WaveGlow models.

Clone the repository:

git clone https://github.com/ide8/tacotron2  
cd tacotron2
PROJDIR=$(pwd)
export PYTHONPATH=$PROJDIR:$PYTHONPATH

Requirements

This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

NVIDIA Docker
PyTorch 19.06-py3+ NGC container or newer
NVIDIA Volta or Turing based GPU

Setup

Build an image from Docker file:

docker build --tag taco .

Run docker container:

docker run --shm-size=8G --runtime=nvidia -v /absolute/path/to/your/code:/app -v /absolute/path/to/your/training_data:/mnt/train -v /absolute/path/to/your/logs:/mnt/logs -v /absolute/path/to/your/raw-data:/mnt/raw-data -v /absolute/path/to/your/pretrained-checkpoint:/mnt/pretrained -detach taco sleep inf

Check container id:

docker ps

Select container id of image with tag taco and log into container with:

docker exec -it container_id bash

Code structure description

Folders tacotron2 and waveglow have scripts for Tacotron 2, WaveGlow models and consist of:

/model.py - model architecture
/data_function.py - data loading functions
/loss_function.py - loss function

Folder common contains common layers for both models (common/layers.py), utils (common/utils.py) and audio processing (common/audio_processing.py and common/stft.py).

Folder router is used by training script to select an appropriate model

In the root directory:

train.py - script for model training
preprocess.py - performs audio processing and creates training and validation datasets
inference.ipynb - notebook for running inference

Folder configs contains __init__.py with all parameters needed for training and data processing. Folder configs/experiments consists of all the experiments. waveglow.py and tacotron2.py are provided as examples for WaveGlow and Tacotron 2. On training or data processing start, parameters are copied from your experiment (in our case - from waveglow.py or from tacotron2.py) to __init__.py, from which they are used by the system.

Data preprocessing

Preparing for data preprocessing

For each speaker you have to have a folder named with speaker name, containing wavs folder and metadata.csv file with the next line format: file_name.wav|text.
All necessary parameters for preprocessing should be set in configs/experiments/waveglow.py or in configs/experiments/tacotron2.py, in the class PreprocessingConfig.
If you're running preprocessing first time, set start_from_preprocessed flag to False. preprocess.py performs trimming of audio files up to PreprocessingConfig.top_db (cuts the silence in the beginning and the end), applies ffmpeg command in order to mono, make same sampling rate and bit rate for all the wavs in dataset.
It saves a folder wavs with processed audio files and data.csv file in PreprocessingConfig.output_directory with the following format: path|text|speaker_name|speaker_id|emotion|text_len|duration.
Trimming and ffmpeg command are applied only to speakers, for which flag process_audio is True. Speakers with flag emotion_present is False, are treated as with emotion neutral-normal.
You won't need start_from_preprocessed = False once you finish running preprocessing script. Only exception in case of new raw data comes in.
Once start_from_preprocessed is set to True, script loads file data.csv (created by the start_from_preprocessed = False run), and forms train.txt and val.txt out from data.csv.
Main PreprocessingConfig parameters:
1. cpus - defines number of cores for batch generator
2. sr - defines sample ratio for reading and writing audio
3. emo_id_map - dictionary for emotion name to emotion_id mapping
4. data[{'path'}] - is path to folder named with speaker name and containing wavs folder and metadata.csv with the following line format: file_name.wav|text|emotion (optional)
Preprocessing script forms training and validation datasets in the following way:
1. selects rows with audio duration and text length less or equal those for speaker PreprocessingConfig.limit_by (this step is needed for proper batch size)
2. if such speaker is not present, than it selects rows within PreprocessingConfig.text_limit and PreprocessingConfig.dur_limit. Lower limit for audio is defined by PreprocessingConfig.minimum_viable_dur
3. in order to be able to use the same batch size as NVIDIA guys, set PreprocessingConfig.text_limit to linda_jonson
4. splits dataset randomly by ratio train : val = 0.95 : 0.05
5. if speaker train set is bigger than PreprocessingConfig.n - samples n rows
6. saves train.txt and val.txt to PreprocessingConfig.output_directory
7. saves emotion_coefficients.json and speaker_coefficients.json with coefficients for loss balancing (used by train.py).

Run preprocessing

Since both scripts waveglow.py and tacotron2.py contain the class PreprocessingConfig, training and validation dataset can be produced by running any of them:

python preprocess.py --exp tacotron2

python preprocess.py --exp waveglow

Training

Preparing for training

Tacotron 2

In configs/experiment/tacotron2.py, in the class Config set:

training_files and validation_files - paths to train.txt, val.txt;
tacotron_checkpoint - path to pretrained Tacotron 2 if it exist (we were able to restore Waveglow from Nvidia, but Tacotron 2 code was edited to add speakers and emotions, so Tacotron 2 needs to be trained from scratch);
speaker_coefficients - path to speaker_coefficients.json;
emotion_coefficients - path to emotion_coefficients.json;
output_directory - path for writing logs and checkpoints;
use_emotions - flag indicating emotions usage;
use_loss_coefficients - flag indicating loss scaling due to possible data disbalance in terms of both speakers and emotions; for balancing loss, set paths to jsons with coefficients in emotion_coefficients and speaker_coefficients;
model_name - "Tacotron2".

Launch training

Single gpu:
```
python train.py --exp tacotron2
```

Multigpu training:

python -m multiproc train.py --exp tacotron2

WaveGlow:

In configs/experiment/waveglow.py, in the class Config set:

training_files and validation_files - paths to train.txt, val.txt;
waveglow_checkpoint - path to pretrained Waveglow, restored from Nvidia. Download checkopoint.
output_directory - path for writing logs and checkpoints;
use_emotions - False;
use_loss_coefficients - False;
model_name - "WaveGlow".

Launch training

Single gpu:
```
python train.py --exp waveglow
```

Multigpu training:

python -m multiproc train.py --exp waveglow

Running Tensorboard

Once you made your model start training, you might want to see some progress of training:

docker ps

Select container id of image with tag taco and run:

docker exec -it container_id bash

Start Tensorboard:

 tensorboard --logdir=path_to_folder_with_logs --host=0.0.0.0

Loss is being written into tensorboard:

Audio samples together with attention alignments are saved into tensorbaord each Config.epochs_per_checkpoint. Transcripts for audios are listed in Config.phrases

Inference

Running inference with the inference.ipynb notebook.

Run Jupyter Notebook:

jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root

output:

root@04096a19c266:/app# jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root
[I 09:31:25.393 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 09:31:25.393 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 09:31:25.395 NotebookApp] Serving notebooks from local directory: /app
[I 09:31:25.395 NotebookApp] The Jupyter Notebook is running at:
[I 09:31:25.395 NotebookApp] http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce
[I 09:31:25.395 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:31:25.398 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-15398-open.html
    Or copy and paste one of these URLs:
        http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

Select adress with 127.0.0.1 and put it in the browser. In this case: http://127.0.0.1:6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

This script takes text as input and runs Tacotron 2 and then WaveGlow inference to produce an audio file. It requires pre-trained checkpoints from Tacotron 2 and WaveGlow models, input text, speaker_id and emotion_id.

Change paths to checkpoints of pretrained Tacotron 2 and WaveGlow in the cell [2] of the inference.ipynb.
Write a text to be displayed in the cell [7] of the inference.ipynb.

Parameters

In this section, we list the most important hyperparameters, together with their default values that are used to train Tacotron 2 and WaveGlow models.

Shared parameters

epochs - number of epochs (Tacotron 2: 1501, WaveGlow: 1001)
learning-rate - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4)
batch-size - batch size (Tacotron 2: 64, WaveGlow: 11)
grad_clip_thresh - gradient clipping treshold (0.1)

Shared audio/STFT parameters

sampling-rate - sampling rate in Hz of input and output audio (22050)
filter-length - (1024)
hop-length - hop length for FFT, i.e., sample stride between consecutive FFTs (256)
win-length - window size for FFT (1024)
mel-fmin - lowest frequency in Hz (0.0)
mel-fmax - highest frequency in Hz (8.000)

Tacotron parameters

anneal-steps - epochs at which to anneal the learning rate (500/ 1000/ 1500)
anneal-factor - factor by which to anneal the learning rate (0.1) These two parameters are used to change learning rate at the points defined in anneal-steps according to:
learning_rate = learning_rate * ( anneal_factor ** p),
where p = 0 at the first step and increments by 1 each step.

WaveGlow parameters

segment-length - segment length of input audio processed by the neural network (8000). Before passing to input, audio is padded or croped to segment-length.
wn_config - dictionary with parameters of affine coupling layers. Contains n_layers, n_chanels, kernel_size.

Contributing

If you've ever wanted to contribute to open source, and a great cause, now is your chance!

See the contributing docs for more information

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 8, 2022

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

7 Nov 11, 2022

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 2, 2023

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

6.5k Jan 8, 2023

Command Line Text-To-Speech using Google TTS

Inference inconsistency

After 750 epochs, we tested the created tacotron model via inference.ipynb and realized that for same input text sequence we get different generated output audio file. Additionally, there is always a large empty space at the beginning of the audio file of approximately 30 seconds. Just to mention, the data was previously preprocessed as explained in the readme file. Sometimes there is just a noise in the audio file, and the other times there is some speaking at the end of the audio file. Do you have any experience with this issue?

opened by msim216 1
Griffin Lim

Hi

First of all, thanks for the repository.

I am trying to train another dataset in other language using this repository, and since I do not have any pretrained waveglow model I cannot train a new Tacotron2 model... Is there any way to perform Griffin Lim on the inferred Mel spectrograms? I am having some issues regarding tensor dimensionality and I did not manage to get any audio...

Thanks in advance

Ander

opened by agonzalezd 1
Pretrained Model

hey it will be good, if you can share your pretrained model with proper alignment, i am training from scratch since 6 days and not getting any alignment

opened by TheSeeker218 0

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Related tags

Overview

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Table of Contents

General description

Done:

TODO:

Getting Started

Requirements

Setup

Code structure description

Data preprocessing

Preparing for data preprocessing

Run preprocessing

Training

Preparing for training

Tacotron 2

WaveGlow:

Running Tensorboard

Inference

Parameters

Shared parameters

Shared audio/STFT parameters

Tacotron parameters

WaveGlow parameters

Contributing

You might also like...

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

TTS is a library for advanced Text-to-Speech generation.

Command Line Text-To-Speech using Google TTS

Maix Speech AI lib, including ASR, chat, TTS etc.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Comments

Inference inconsistency

Griffin Lim

Pretrained Model

Owner

Ivan Didur

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

vits chinese, tts chinese, tts mandarin

Ukrainian TTS (text-to-speech) using Coqui TTS

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Pytorch implementation of Tacotron

PyTorch implementation of Tacotron speech synthesis model.

NLP tool to extract emotional phrase from tweets 🤩

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.