Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience

Last update: Dec 25, 2022

Related tags

Deep Learning text-to-speech speech pytorch tts speech-synthesis speech-edit

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech. Audio samples are available on our demo page.

Abstract

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.

Citation

Please cite this work as follows.

@misc{tae&kim2021editts,
      title={EdiTTS: Score-based Editing for Controllable Text-to-Speech}, 
      author={Jaesung Tae and Hyeongju Kim and Taesu Kim},
      year={2021}
}

Setup

Create a Python virtual environment (venv or conda) and install package requirements as specified in requirements.txt.
```
python -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
```

Build the monotonic alignment module.

cd model/monotonic_align
python setup.py build_ext --inplace

For more information, refer to the official repository of Grad-TTS.

Checkpoints

The following checkpoints are already included as part of this repository, under checkpts.

Pitch Shifting

Prepare an input file containing samples for speech generation. Mark the segment to be edited via a vertical bar separator, |. For instance, a single sample might look like

In | the face of impediments confessedly discouraging |

We provide a sample input file in resources/filelists/edit_pitch_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_pitch.py \
    -f resources/filelists/edit_pitch_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/pitch/wavs

Adjust CUDA_VISIBLE_DEVICES as appropriate.

Content Replacement

Prepare an input file containing pairs of sentences. Concatenate each pair with # and mark the parts to be replaced with a vertical bar separator. For instance, a single pair might look like

Three others subsequently | identified | Oswald from a photograph. #Three others subsequently | recognized | Oswald from a photograph.

We provide a sample input file in resources/filelists/edit_content_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_content.py \
    -f resources/filelists/edit_content_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/content/wavs

References

License

Released under the modified GNU General Public License.

You might also like...

Official implementation for "Style Transformer for Image Inversion and Editing" (CVPR 2022)

Style Transformer for Image Inversion and Editing (CVPR2022) https://arxiv.org/abs/2203.07932 Existing GAN inversion methods fail to provide latent co

153 Dec 2, 2022

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

Gotta Go Fast When Generating Data with Score-Based Models This repo contains the official implementation for the paper Gotta Go Fast When Generating

89 Nov 9, 2022

Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Wilson Lower Bound Score, Time Based Rating Average In this study I tried to calculate the product rating and sorting reviews more accurately. I have

3 Sep 30, 2021

Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

GANVAS-models This is an implementation of various generative models. It contains implementations of the following: Autoregressive Models: PixelCNN, G

MRSAIL (Mini Robotics, Software & AI Lab)

6 Nov 26, 2022

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

101 Dec 16, 2022

Changing the Mind of Transformers for Topically-Controllable Language Generation

We will first introduce the how to run the IPython notebook demo by downloading our pretrained models. Then, we will introduce how to run our training and evaluation code.

20 Dec 6, 2022

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Cha

628 Dec 28, 2022

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics This work will be published in Nature Biomedical

71 Nov 15, 2022

The Adapter-Bot: All-In-One Controllable Conversational Model

The Adapter-Bot: All-In-One Controllable Conversational Model This is the implementation of the paper: The Adapter-Bot: All-In-One Controllable Conver

37 Nov 4, 2022

Comments

How to use voice files instead pure TTS?

In papers you say about LJ speech dataset test (4.3 Content replacement). Can you provide code for loading voice files instead pure sample generation in tts.py?

opened by Vadim2S 1

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Related tags

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Abstract

Citation

Setup

Checkpoints

Pitch Shifting

Content Replacement

References

License

You might also like...

Official implementation for "Style Transformer for Image Inversion and Editing" (CVPR 2022)

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

Changing the Mind of Transformers for Topically-Controllable Language Generation

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

The Adapter-Bot: All-In-One Controllable Conversational Model

Comments

How to use voice files instead pure TTS?

Owner

Neosapience

Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

Official code for Score-Based Generative Modeling through Stochastic Differential Equations

The official pytorch implemention of the CVPR paper "Temporal Modulation Network for Controllable Space-Time Video Super-Resolution".

Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

African language Speech Recognition - Speech-to-Text

PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

This is the PyTorch implementation of GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation

An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`