EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience

Last update: Jan 2, 2023

Related tags

Text Data & NLP text-to-speech speech pytorch tts speech-synthesis speech-edit

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech. Audio samples are available on our demo page.

Abstract

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.

Citation

Please cite this work as follows.

@misc{tae&kim2021editts,
      title={EdiTTS: Score-based Editing for Controllable Text-to-Speech}, 
      author={Jaesung Tae and Hyeongju Kim and Taesu Kim},
      year={2021}
}

Setup

Create a Python virtual environment (venv or conda) and install package requirements as specified in requirements.txt.
```
python -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
```

Build the monotonic alignment module.

cd model/monotonic_align
python setup.py build_ext --inplace

For more information, refer to the official repository of Grad-TTS.

Checkpoints

The following checkpoints are already included as part of this repository, under checkpts.

Pitch Shifting

Prepare an input file containing samples for speech generation. Mark the segment to be edited via a vertical bar separator, |. For instance, a single sample might look like

In | the face of impediments confessedly discouraging |

We provide a sample input file in resources/filelists/edit_pitch_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_pitch.py \
    -f resources/filelists/edit_pitch_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/pitch/wavs

Adjust CUDA_VISIBLE_DEVICES as appropriate.

Content Replacement

Prepare an input file containing pairs of sentences. Concatenate each pair with # and mark the parts to be replaced with a vertical bar separator. For instance, a single pair might look like

Three others subsequently | identified | Oswald from a photograph. #Three others subsequently | recognized | Oswald from a photograph.

We provide a sample input file in resources/filelists/edit_content_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_content.py \
    -f resources/filelists/edit_content_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/content/wavs

References

License

Released under the modified GNU General Public License.

Comments

How to use voice files instead pure TTS?

In papers you say about LJ speech dataset test (4.3 Content replacement). Can you provide code for loading voice files instead pure sample generation in tts.py?

opened by Vadim2S 1

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Related tags

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Abstract

Citation

Setup

Checkpoints

Pitch Shifting

Content Replacement

References

License

You might also like...

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Speech Recognition for Uyghur using Speech transformer

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Comments

How to use voice files instead pure TTS?

Owner

Neosapience

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Simple Speech to Text, Text to Speech

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Estimation of the CEFR complexity score of a given word, sentence or text.

This simple Python program calculates a love score based on your and your crush's full names in English

Package for controllable summarization