PyTorch implementation of Tacotron speech synthesis model.

Overview

tacotron_pytorch

Build Status

PyTorch implementation of Tacotron speech synthesis model.

Inspired from keithito/tacotron. Currently not as much good speech quality as keithito/tacotron can generate, but it seems to be basically working. You can find some generated speech examples trained on LJ Speech Dataset at here.

If you are comfortable working with TensorFlow, I'd recommend you to try https://github.com/keithito/tacotron instead. The reason to rewrite it in PyTorch is that it's easier to debug and extend (multi-speaker architecture, etc) at least to me.

Requirements

  • PyTorch
  • TensorFlow (if you want to run the training script. This definitely can be optional, but for now required.)

Installation

git clone --recursive https://github.com/r9y9/tacotron_pytorch
pip install -e . # or python setup.py develop

If you want to run the training script, then you need to install additional dependencies.

pip install -e ".[train]"

Training

The package relis on keithito/tacotron for text processing, audio preprocessing and audio reconstruction (added as a submodule). Please follows the quick start section at https://github.com/keithito/tacotron and prepare your dataset accordingly.

If you have your data prepared, assuming your data is in "~/tacotron/training" (which is the default), then you can train your model by:

python train.py

Alignment, predicted spectrogram, target spectrogram, predicted waveform and checkpoint (model and optimizer states) are saved per 1000 global step in checkpoints directory. Training progress can be monitored by:

tensorboard --logdir=log

Testing model

Open the notebook in notebooks directory and change checkpoint_path to your model.

Comments
  • Update model

    Update model

    Note to self: When I finish current experiment, I will update http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb.

    wontfix 
    opened by r9y9 37
  • Implementation of Bahdanau attention is possibly different from paper?

    Implementation of Bahdanau attention is possibly different from paper?

    Hi @r9y9. On the forward pass of the attention RNN, I think you're computing the attention state using the previous attention instead of the current attention. Page 3 on https://arxiv.org/pdf/1409.0473.pdf

    Shouldn't the order of operations be this:

    def forward(self, query, attention, cell_state, memory, processed_memory=None, mask=None, memory_lengths=None):
        
        if processed_memory is None:
            processed_memory = memory
        if memory_lengths is not None and mask is None:
            mask = get_mask_from_lengths(memory, memory_lengths)
    
        # Compute Alignment (batch, max_time)
        # e_{ij} = a(s_{i-1}, h_j)
        alignment = self.attention_mechanism(cell_state, processed_memory)
    
        if mask is not None:
            mask = mask.view(query.size(0), -1)
            alignment.data.masked_fill_(mask, self.score_mask_value)
    
        # Normalize attention weight
        # \alpha_{ij} = softmax(e_{ij})
        alignment = F.softmax(alignment)
    
        # Attention context vector
        # c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j
        attention = torch.bmm(alignment.unsqueeze(1), memory)
        attention = attention.squeeze(1)
    
        # Concat y_{i-1} and c_{i}
        cell_input = torch.cat((query, attention), -1)
        cell_input = cell_input.unsqueeze(1)
    
        # Feed it to RNN
        # s_i = f(y_{i-1}, c_{i}, s_{i-1})
        cell_output = self.rnn_cell(cell_input, cell_state)
    
        return cell_output, attention, alignment
    

    instead of what we have on this repo right now:

    def forward(self, query, attention, cell_state, memory, processed_memory=None, mask=None, memory_lengths=None):
    
        if processed_memory is None:
            processed_memory = memory
        if memory_lengths is not None and mask is None:
            mask = get_mask_from_lengths(memory, memory_lengths)
    
        # Concat y_{i-1} and c_{i}
        cell_input = torch.cat((query, attention), -1)
    
        # Feed it to RNN
        # s_i = f(y_{i-1}, c_{i-1}, s_{i-1}) should be f(y_{i-1}, c_{i}, s_{i-1}) according to the paper.
        cell_output = self.rnn_cell(cell_input, cell_state)
    
        # Compute Alignment (batch, max_time)
        # e_{ij} = a(s_{i-1}, h_j)
        alignment = self.attention_mechanism(cell_output, processed_memory)
    
        if mask is not None:
            mask = mask.view(query.size(0), -1)
            alignment.data.masked_fill_(mask, self.score_mask_value)
    
        # Normalize attention weight
        # \alpha_{ij} = softmax(e_{ij})
        alignment = F.softmax(alignment)
    
        # Attention context vector
        # c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j
        attention = torch.bmm(alignment.unsqueeze(1), memory)
    
        # (batch, dim)
        attention = attention.squeeze(1)
    
        return cell_output, attention, alignment
    
    wontfix 
    opened by rafaelvalle 9
  • ModuleNotFoundError: No module named 'text'

    ModuleNotFoundError: No module named 'text'

    Hi

    thanks for this repo. I am a very newbie. I have cloned the repo and have run setup.py.

    However I get the following error

    ModuleNotFoundError Traceback (most recent call last) in () 5 import sys 6 sys.path.insert(0, "../lib/tacotron") ----> 7 from text import text_to_sequence, symbols 8 from util import audio

    ModuleNotFoundError: No module named 'text'

    Could you please advise me what to do next?

    Thanks

    I am trying to use the model checkpoint from keithito/tacotron

    wontfix 
    opened by dza6549 5
  • retraining from the checkpoint on GPU fails

    retraining from the checkpoint on GPU fails

    Hey,

    Thanks for the implementation. Noticed that trying to retrain from a checkpoint fails during optimizer.step().

    Seems because the optimizer is initialized based on model parameters and then .cuda() is called on the model afterwards. Adding .cuda() before defining optimizer (only while retraining) seems to help.

    wontfix 
    opened by saikrishnarallabandi 4
  • why is_end_of_frames can detection the end frame in test phase?

    why is_end_of_frames can detection the end frame in test phase?

    thanks for your code. I have a question about the tacotron_pytorch/tacotron.py line 274. why output.data <= 0.2 is the end frame in test phase. if i use this funtion, i only can decode 2 step in test time.

    def is_end_of_frames(output, eps=0.2): return (output.data <= eps).all()

    wontfix 
    opened by wangjunchao1118 4
  • About BatchNorm

    About BatchNorm

    https://github.com/r9y9/tacotron_pytorch/blob/5f41d9d8f70a299a49f02aa5422478e1693ebe93/tacotron_pytorch/tacotron.py#L36

    In your implementation code, it says "following tensorflow's default parameters". However, the momentum in PyTorch is the opposite with Tensroflow.

    # PyTorch
    x(n+1) = (1 - momentum) * avg(x(1-n)) + momentum * x(n)  # so default momentum=0.1
    # Tensorflow
    x(n+1) = momentum* avg(x(1-n)) + (1-momentum) * x(n)   # so default momentum=0.99
    

    What's your idea on that ? And I consider why using so large eps(1e-3) for audio ? The default eps for PyTorch is 1e-5

    opened by mazzzystar 3
  • Masked loss function

    Masked loss function

    Hello,

    Shouldn't you apply the L1-loss only to the real frames and not the padding? I.e. you implement correctly the GRU in CBHG with the pack_padded_sequence and the masked attention, but in the end I think that you calculate the L1-loss on the whole generated utterance.

    Please tell me if I am missing something because I am in the middle of some same debugging problems!

    wontfix 
    opened by njellinas 2
  • why turning off dropout of decoder's prenet make a serious performance regression

    why turning off dropout of decoder's prenet make a serious performance regression

    Thanks very much for your codes. When i was working with it, also found that dropout on eval make a serious regresssion. I can not find out where is the problem. May it caused by the combination of BN and dropout?

    wontfix 
    opened by zwlanpishu 2
  • Checkpoint file and dropout on eval.

    Checkpoint file and dropout on eval.

    Would it be possible for you to share your checkpoint that produced the samples at http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb

    I am not able to reproduce the same quality with LJSpeech at 720k steps. Also any ideas on why dropout on eval makes so much difference in speech quality?

    Thanks!

    opened by pbaljeka 2
  • BahdanauMonoAttention cannot work well

    BahdanauMonoAttention cannot work well

    I follow monotonic attention here: https://arxiv.org/pdf/1704.00784.pdf.

    In tensorflow, it work well. (source code here: https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py. )

    But in pytorch, it cannot work. Here is my source code. Could you take a look, please?

    def safe_cumprod(x, exclusive=False, max_value=1):
        """
        exclusive=True: cumprod(x) = [1, x1, x1*x2, x1*x2*x3, ...]
        exclusive=False: cumprod(x) = [x1, x1*x2, x1*x2*x3, ...]
        Args:
            x (torch.Tensor): shape of [batch, input_dim]
            exclusive ():
            max_value (): clip max value
    
        Returns:
    
        """
        tiny = float(np.finfo(np.float32).tiny)
        clip_x = torch.clamp(x, tiny, max_value)
        cumprod_x = torch.exp(torch.cumsum(torch.log(clip_x), dim=1))
        if exclusive is True:
            return F.pad(cumprod_x, (1, 0, 0, 0), value=1)[:, :-1]
        else:
            return cumprod_x
    
    
    class BahdanauAttention(nn.Module):
        def __init__(self, dim):
            super(BahdanauAttention, self).__init__()
            self.query_layer = nn.Linear(dim, dim, bias=False)
            self.tanh = nn.Tanh()
            self.v = Parameter(torch.Tensor(1, dim))
            self.reset_parameters()
    
        def reset_parameters(self):
            fan_in, fan_out = self.v.size()
            scale = 1 / max(1., (fan_in + fan_out) / 2.)
            limit = math.sqrt(3.0 * scale)
            self.v.data.uniform_(-limit, limit)
    
        def _alignment_probability(self, score, previous_alignment=None):
            return F.softmax(score, dim=1)
    
        def forward(self, query, processed_memory):
            """
            Args:
                query: (batch, 1, dim) or (batch, dim)
                processed_memory: (batch, max_time, dim)
            """
            if query.dim() == 2:
                # insert time-axis for broadcasting
                query = query.unsqueeze(1)
            # (batch, 1, dim)
            processed_query = self.query_layer(query)
    
            # (batch, max_time, 1)
            alignment = F.linear(self.tanh(processed_query + processed_memory), self.v)
    
            # (batch, max_time)
            return alignment.squeeze(-1)
    
    
    class BahdanauMonoAttention(BahdanauAttention):
        """BahdanauMonoAttention
        """
        def __init__(self, dim):
            super(BahdanauMonoAttention, self).__init__(dim)
            self.score_bias = Parameter(torch.Tensor(1))
            self.reset_parameters()
    
        def reset_parameters(self):
            self.score_bias.data.zero_()
    
        def forward(self, query, processed_memory):
            return super(BahdanauMonoAttention, self).forward(query, processed_memory) + self.score_bias
    
        def _alignment_probability(self, score, previous_alignment=None):
            """
            _mono_score, https://arxiv.org/pdf/1704.00784.pdf
            Args:
                score (): shape of [batch, encoder_length]
                previous_alignment (): shape of [batch, encoder_length]
    
            Returns:
    
            """
           #score += Variable(torch.FloatTensor(np.random.randn(*score.shape) * 2).cuda())
            p_choose_i = F.sigmoid(score)
            cumprod_1mp_choose_i = safe_cumprod(1 - p_choose_i, exclusive=True, max_value=1)
            attention = p_choose_i * cumprod_1mp_choose_i * torch.cumsum(
                previous_alignment / torch.clamp(cumprod_1mp_choose_i, 1e-10, 1.), dim=1)
            return attention
    
    
    
    def get_mask_from_lengths(memory, memory_lengths):
        """Get mask tensor from list of length
    
        Args:
            memory: (batch, max_time, dim)
            memory_lengths: array like
        """
        mask = memory.data.new(memory.size(0), memory.size(1)).byte().zero_()
        for idx, l in enumerate(memory_lengths):
            mask[idx][:l] = 1
        return ~mask
    
    
    class AttentionWrapper(nn.Module):
        def __init__(self, rnn_cell, attention_mechanism,
                     score_mask_value=-float("inf")):
            super(AttentionWrapper, self).__init__()
            self.rnn_cell = rnn_cell
            self.attention_mechanism = attention_mechanism
            self.score_mask_value = score_mask_value
    
        def forward(self, query, attention, cell_state, memory, previous_alignment=None,
                    processed_memory=None, mask=None, memory_lengths=None):
            if processed_memory is None:
                processed_memory = memory
            if memory_lengths is not None and mask is None:
                mask = get_mask_from_lengths(memory, memory_lengths)
    
            # Concat input query and previous attention context
            cell_input = torch.cat((query, attention), -1)
    
            # Feed it to RNN
            cell_output = self.rnn_cell(cell_input, cell_state)
    
            # Alignment
            # (batch, max_time)
            alignment = self.attention_mechanism(cell_output, processed_memory)
    
            if mask is not None:
                mask = mask.view(query.size(0), -1)
                alignment.data.masked_fill_(mask, self.score_mask_value)
    
            # Normalize attention weight
            # alignment = F.softmax(alignment, dim=-1)
            alignment = self.attention_mechanism._alignment_probability(alignment, previous_alignment)
    
            # Attention context vector
            # (batch, 1, dim)
            attention = torch.bmm(alignment.unsqueeze(1), memory)
    
            # (batch, dim)
            attention = attention.squeeze(1)
    
            return cell_output, attention, alignment
    
    class Decoder(nn.Module):
        def __init__(self, in_dim, r, use_mono=True):
            super(Decoder, self).__init__()
            self.in_dim = in_dim
            self.r = r
            self.prenet = Prenet(in_dim, sizes=[256, 128])
            # (prenet_out + attention context) -> output
            if use_mono is True:
                attention_mechanism = BahdanauMonoAttention(256)
            else:
                attention_mechanism = BahdanauAttention(256)
            self.attention_rnn = AttentionWrapper(
                nn.GRUCell(256 + 128, 256),
                attention_mechanism
            )
            self.memory_layer = nn.Linear(256, 256, bias=False)
            self.project_to_decoder_in = nn.Linear(512, 256)
    
            self.decoder_rnns = nn.ModuleList(
                [nn.GRUCell(256, 256) for _ in range(2)])
    
            self.proj_to_mel = nn.Linear(256, in_dim * r)
            self.max_decoder_steps = 200
    
        def forward(self, encoder_outputs, inputs=None, memory_lengths=None):
            """
            Decoder forward step.
    
            If decoder inputs are not given (e.g., at testing time), as noted in
            Tacotron paper, greedy decoding is adapted.
    
            Args:
                encoder_outputs: Encoder outputs. (B, T_encoder, dim)
                inputs: Decoder inputs. i.e., mel-spectrogram. If None (at eval-time),
                  decoder outputs are used as decoder inputs.
                memory_lengths: Encoder output (memory) lengths. If not None, used for
                  attention masking.
            """
            B = encoder_outputs.size(0)
            T_encoder = encoder_outputs.size(1)
    
            processed_memory = self.memory_layer(encoder_outputs)
            if memory_lengths is not None:
                mask = get_mask_from_lengths(processed_memory, memory_lengths)
            else:
                mask = None
    
            # Run greedy decoding if inputs is None
            greedy = inputs is None
    
            if inputs is not None:
                # Grouping multiple frames if necessary
                if inputs.size(-1) == self.in_dim:
                    inputs = inputs.view(B, inputs.size(1) // self.r, -1)
                assert inputs.size(-1) == self.in_dim * self.r
                T_decoder = inputs.size(1)
    
            # go frames
            initial_input = Variable(
                encoder_outputs.data.new(B, self.in_dim).zero_())
    
            # Init decoder states
            attention_rnn_hidden = Variable(
                encoder_outputs.data.new(B, 256).zero_())
            decoder_rnn_hiddens = [Variable(
                encoder_outputs.data.new(B, 256).zero_())
                for _ in range(len(self.decoder_rnns))]
            current_attention = Variable(
                encoder_outputs.data.new(B, 256).zero_())
    
            # Time first (T_decoder, B, in_dim)
            if inputs is not None:
                inputs = inputs.transpose(0, 1)
    
            outputs = []
            alignments = []
    
            t = 0
            current_input = initial_input
            previous_alignment = Variable(
                encoder_outputs.data.new(B, T_encoder).zero_())
            previous_alignment[:, 0] = 1.0
            while True:
                if t > 0:
                    current_input = outputs[-1] if greedy else inputs[t - 1]
                    current_input = current_input[:, -self.in_dim:]
                # Prenet
                current_input = self.prenet(current_input)
    
                # Attention RNN
                attention_rnn_hidden, current_attention, alignment = self.attention_rnn(
                    current_input, current_attention, attention_rnn_hidden,
                    encoder_outputs, previous_alignment=previous_alignment,
                    processed_memory=processed_memory, mask=mask)
                previous_alignment = alignment
    
                # Concat RNN output and attention context vector
                decoder_input = self.project_to_decoder_in(
                    torch.cat((attention_rnn_hidden, current_attention), -1))
    
                # Pass through the decoder RNNs
                for idx in range(len(self.decoder_rnns)):
                    decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
                        decoder_input, decoder_rnn_hiddens[idx])
                    # Residual connectinon
                    decoder_input = decoder_rnn_hiddens[idx] + decoder_input
    
                output = decoder_input
                output = self.proj_to_mel(output)
    
                outputs += [output]
                alignments += [alignment]
    
                t += 1
    
                if greedy:
                    if t > 1 and is_end_of_frames(output):
                        break
                    elif t > self.max_decoder_steps:
                        print("Warning! doesn't seems to be converged")
                        break
                else:
                    if t >= T_decoder:
                        break
    
            assert greedy or len(outputs) == T_decoder
    
            # Back to batch first
            alignments = torch.stack(alignments).transpose(0, 1)
            outputs = torch.stack(outputs).transpose(0, 1).contiguous()
    
            return outputs, alignments
    

    @r9y9

    wontfix discussion 
    opened by zhbbupt 2
  • How are linear targets being passed to the model?

    How are linear targets being passed to the model?

    This might be a really dumb question....but I am not sure I understand why you are not passing the linear targets to the model here? https://github.com/r9y9/tacotron_pytorch/blob/master/train.py#L240 Are you using a pre-trained post-net in your implementation?

    opened by 7404N 2
  • about validation

    about validation

    First thanks to you great work ! But I have some questions about it, why in the training code doesn't have validation part ? Can I just add it like the testing code do ? but in the testing code the input batch size is 1, and I don't know can I inference one batch. And when I use the cmu clb dataset to train the model, I can see the alignment is well for the training data, but when I use the ckpt to inference, it's quiet bad. Is that the model is overfitting for the small dataset (about one hour) ?

    opened by ArtemisZGL 0
Owner
Ryuichi Yamamoto
Speech Synthesis, Voice Conversion, Machine Learning
Ryuichi Yamamoto
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 3, 2023
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

WaveGrad2 - PyTorch Implementation PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. Status (202

Keon Lee 59 Dec 6, 2022
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 63 Jan 2, 2023
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Keon Lee 47 Dec 18, 2022
Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

sarulab-speech 24 Sep 7, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Jan 8, 2023
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

null 6 Nov 2, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 65 Dec 20, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

Keon Lee 152 Jan 2, 2023
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Rishikesh (ऋषिकेश) 31 Dec 8, 2022
A Flow-based Generative Network for Speech Synthesis

WaveGlow: a Flow-based Generative Network for Speech Synthesis Ryan Prenger, Rafael Valle, and Bryan Catanzaro In our recent paper, we propose WaveGlo

NVIDIA Corporation 2k Dec 26, 2022
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 172 Dec 23, 2022
This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

??️ aspeak A simple text-to-speech client using azure TTS API(trial). ?? TL;DR: This program uses trial auth token of Azure Cognitive Services to do s

Levi Zim 359 Jan 5, 2023
PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

FullSubNet This Git repository for the official PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech E

郝翔 357 Jan 4, 2023
Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

GradTTS Unofficial Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech" (arxiv) About this repo This is an unoffic

HeyangXue1997 103 Dec 23, 2022