Comprehensive-E2E-TTS - PyTorch Implementation

Keon Lee

Last update: Nov 13, 2022

Related tags

Text Data & NLP text-to-speech deep-learning unsupervised end-to-end pytorch tts speech-synthesis jets multi-speaker sota single-speaker neural-tts non-autoregressive fastspeech2 hifi-gan non-ar ultimate-tts text-to-wav

Overview

Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (generating waveform given text), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS. Any suggestions toward the best End-to-End TTS are welcome :)

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models (will be shared soon) and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

Run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES= at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost.

Notes

Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

You might also like...

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 8, 2022

Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

267 Dec 25, 2022

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

19 Oct 14, 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

829 Jan 7, 2023

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 2, 2023

A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

2.1k Jan 1, 2023

Repository for the paper: VoiceMe: Personalized voice generation in TTS

🗣 VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

80 Dec 29, 2022

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

Comments

Variance Loss RuntimeError

Hi there,

I'm trying to train a model with LJ data, but at step 50331 I get:

  File "train.py", line 339, in <module>
    train(0, args, configs, batch_size, num_gpus)
  File "train.py", line 196, in train
    ) = Loss.variance_loss(batch, output, step=step)
  File "/gpfs/fs2c/nrc/ict/portage/u/tts/code/Comprehensive-E2E-TTS-new/model/loss.py", line 206, in variance_loss
    ctc_loss = self.sum_loss(attn_logprob=attn_logprob, in_lens=src_lens, out_lens=mel_lens)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/gpfs/fs2c/nrc/ict/portage/u/tts/code/Comprehensive-E2E-TTS-new/model/loss.py", line 249, in forward
    target_lengths=key_lens[bid : bid + 1],
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1502, in forward
    self.zero_infinity)
  File "/space/partner/nrc/work/ict/portage/u/tts/opt/miniconda3/envs/jets/lib/python3.7/site-packages/torch/nn/functional.py", line 2201, in ctc_loss
    zero_infinity)
RuntimeError: Expected input_lengths to have value at most 693, but got value 694 (while checking arguments for ctc_loss_gpu)

I'm using the default configuration (https://github.com/keonlee9420/Comprehensive-E2E-TTS/tree/main/config/LJSpeech) except I reduced the batch size to 10 to fit my GPU. Is the reason this only showed up now due to var_start_steps? Any advise would be appreciated.

opened by roedoejet 2

Question about Differentiable Duration Modeling

Hello, I'm trying to implement Differentiable Duration Modeling(DDM) module introduced in Differentiable Duration Modeling for End-to-End Text-to-Speech.

I opened this issue to get advice on implementation DDM.

My Implementation of Differentiable Alignment Encoder outputs attention like thing from noise input. But the training speed of DDM is too slow(10s/iter). Seems like it hanged in backward progress.

Can anyone give me some advice to improve the speed of recursive tensor operation? Should I use cuda.jit like Soft DTW? Or is there something wrong with the approach itself?

The module's output from noise input and code is like below.

Thank you.

dae = DifferentiableAlignmentEncoder()
b = 5
text_max_len = 25
mel_max_len = 85
dim = 256
x_len = torch.randint(1, text_max_len, (b,))
mel_len = torch.randint(2, mel_max_len, (b,))
x = torch.randn(b, max(x_len), dim)
s, l, q, dur = dae(x, x_len, mel_len)
i = 2
plt.imshow(l[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.imshow(q[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.imshow(s[i, :x_len[i], :mel_len[i]].detach().numpy())
plt.plot(dur[i, :x_len[i]])

L Q S = soft attention Duration

Code

class DifferentiableAlignmentEncoder(nn.Module):
    def __init__(
        self,
        hidden_dim=256,
        conv_kernels=3,
        num_layers=3,
        dropout_p=0.2,
        max_mel_len=1150 # Max Length of Mel-Spectrogram Frame in training data
    ):
        super().__init__()
        
        self.conv_layer_blocks = nn.ModuleList([
            nn.Sequential(
                ConvNorm(hidden_dim, hidden_dim, conv_kernels, bias=True, transpose=True),
                nn.ReLU(),
                nn.LayerNorm(hidden_dim),
                nn.Dropout(dropout_p)
            )
            for i in range(num_layers)
        ])
        self.dur_prob_proj = LinearNorm(hidden_dim, max_mel_len, bias=False)
        
        self.ddm = DifferentiableDurationModeling()
    
    def forward(self, x, phon_lens, mel_lens, x_masks=None):
        
        """
        x  : Tensor[B, T_phon, C_phone]
        phon_lens : LongTensor[B]
        mel_lens : LongTensor[B]
        s : S Matrix : Tensor[B, T_phon, T_mel]
        dur : Duration Matrix : Tensor[B, T_phon]
        """
        
        max_mel_len = int(torch.max(mel_lens))
        
        for layer in self.conv_layer_blocks:
            if x_masks is not None:
                x = x * (1 - x_masks.float())
            x = layer(x)
        x = self.dur_prob_proj(x)
                
        norm = torch.randn(x.shape).to(x.device)
        x = x + norm
        
        p = torch.sigmoid(x)
        p = p[:, :, :max_mel_len]
        
        s, l, q, dur = self.ddm(p, phon_lens, mel_lens)
        
        dur = dur.detach()
        
        return s, l, q, dur
    
    
class DifferentiableDurationModeling(nn.Module):
    def __init__(self):
        super().__init__()
        
    def _get_attn_mask(self, phon_lens, mel_lens):
        phon_mask = ~get_mask_from_lengths(phon_lens)
        mel_mask = ~get_mask_from_lengths(mel_lens)
        
        return phon_mask.unsqueeze(-1) * mel_mask.unsqueeze(1), phon_mask
    
    def forward(self, p, phon_lens, mel_lens):
        
        attn_mask, phon_mask = self._get_attn_mask(phon_lens, mel_lens)
        
        p = p * attn_mask
        
        l = self._get_l(p, attn_mask)
        
        l = l * attn_mask

        dur = self._get_duration(l)
        
        dur = dur * phon_mask

        q = self._get_q(l)
        
        q = q * attn_mask
        
        s = self._get_s(q, l)
        
        s = s * attn_mask
            
        return s, l, q, dur
    
    def _get_duration(self, l):
        with torch.no_grad():
            m = torch.arange(1, l.shape[-1] + 1)[None, :].expand_as(l).to(l.device)
            dur = torch.sum(m * l, dim=-1)
        return dur
    
    def _get_l(self, p, mask):
        # getting l is numerically unstable for the gradient computation.
        # Paper's Author resolve this issue by computing this product in the log-space
        _p = torch.log(mask[:, :, 1:].float() - p[:, :, 1:] + 1e-8)
        p = torch.log(p + 1e-8)
        com = torch.cumsum(_p, dim=-1)
        l_0 = com[:, :, -1].unsqueeze(-1)
        l_1 = p[:, :, 1].unsqueeze(-1)
        
        l_m = com[:, :, :-1] + p[:, :, 2:]
                
        l = torch.cat([l_0, l_1, l_m], dim=-1)

        l = torch.exp(l)
        
        return l
    
    def _variable_kernel_size_convolution(self, x, y, length):
        matrix = torch.flip(x.unsqueeze(1) * y.unsqueeze(-1), dims=[-1])
        output =  torch.flip(
            torch.cat(
                [
                    torch.sum(
                        torch.diagonal(
                            matrix, offset=idx, dim1=-2, dim2=-1
                        ), dim=1
                    ).unsqueeze(1) 
                    for idx in range(length)
                ],
                dim=1
            ),
            dims=[1] 
        )
        return output
    
    def _get_q(self, l):
        length = l.shape[-1]
        q = [l[:, 0, :]]
        if l.shape[-1] > 1:
            for i in range(1, l.shape[1]):
                q.append(self._variable_kernel_size_convolution(q[i-1], l[:, i], length))
                        
        q = torch.cat([_.unsqueeze(1) for _ in q], dim=1)
        
        return q   

    def _reverse_cumsum(self, x):
        return torch.flip(torch.cumsum(torch.flip(x, dims=[-1]), dim=-1), dims=[-1])
    
    def _get_s(self, q, l):
        length = l.shape[-1]
        l_rev_cumsum = self._reverse_cumsum(l)
        s = [l_rev_cumsum[:, 0, :]]
        
        if l.shape[-1] > 1:
            for i in range(1, q.shape[1]):
                s.append(self._variable_kernel_size_convolution(q[:, i-1], l_rev_cumsum[:, i], length))
        
        s = torch.cat([_.unsqueeze(1) for _ in s], dim=1)
            
        return s

opened by LEECHOONGHO 0

severe metallic sound

Hi, thanks for your nice jobs. I used your codes for ny own datasets and the synthesized voices seems not that normal at 160K steps now. Though we could still figure out what's being saied, the spectrum is unnormal (especially the high frequency part, as you can see from the following figures.) with severe metallic sound. I have double checked the feature extraction process and the training process, and all are normal. Do you know any reason about it? BTW, how many steps are required to train the LJSpeech model?

Thanks again.

opened by GuangChen2016 9

Owner

Keon Lee

Everything towards conversational AI

GitHub

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

237 Jan 2, 2023

vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单，音质最好的语音合成系统

12 Dec 14, 2022

Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

85 Dec 26, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

7 Nov 11, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

6.5k Jan 8, 2023

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

106 Jan 1, 2023

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

Comprehensive-E2E-TTS - PyTorch Implementation

Related tags

Overview

Comprehensive-E2E-TTS - PyTorch Implementation

Architecture Design

Linguistic Encoder

Audio Upsampler

Duration Modeling

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Notes

Citation

References

You might also like...

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Maix Speech AI lib, including ASR, chat, TTS etc.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

A multi-voice TTS system trained with an emphasis on quality

Repository for the paper: VoiceMe: Personalized voice generation in TTS

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Comments

Variance Loss RuntimeError

Question about Differentiable Duration Modeling

severe metallic sound

Owner

Keon Lee

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

vits chinese, tts chinese, tts mandarin

Ukrainian TTS (text-to-speech) using Coqui TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

TTS is a library for advanced Text-to-Speech generation.

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Command Line Text-To-Speech using Google TTS