Repository for the paper: VoiceMe: Personalized voice generation in TTS

Pol van Rijn

Last update: Dec 29, 2022

Related tags

Text Data & NLP VoiceMe

Overview

🗣 VoiceMe: Personalized voice generation in TTS

Abstract

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

Demos

📢 Demo website
🔇 Unmute to listen to the videos on Github:

Examples-for-art-works.mp4

Example-chain.mp4

Preprocessing

Setup the repository

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

preprocessing_env="$main_dir/preprocessing-env"
conda create --prefix $preprocessing_env python=3.7
conda activate $preprocessing_env
pip install Cython
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]
pip install requests

Create face styles

We used the same sentence ("Kids are talking by the door", neutral recording) from the RAVDESS corpus from all 24 speakers. You can download all videos by running download_RAVDESS.sh. However, the stills used in the paper are also part of the repository (stills). We can create the AI Gahaku styles by running python ai_gahaku.py and the toonified version by running python toonify.py (you need to add your API key).

Obtain the PCA space

The model used in the paper was trained on SpeakerNet embeddings, so we to extract the embeddings from a dataset. Here we use the commonvoice data. To download it, run: python preprocess_commonvoice.py --language en

To extract the principal components, run compute_pca.py.

Synthesis

Setup

We'll assume, you'll setup a remote instance for synthesis. Clone the repo and setup the virtual environment:

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

synthesis_env="$main_dir/synthesis-env"
conda create --prefix $synthesis_env python=3.7
conda activate $synthesis_env

##############
# Setup Wav2Lip
##############
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

# Install Requirements
pip install -r requirements.txt
pip install opencv-python-headless==4.1.2.30
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"  --no-check-certificate

# Install as package
mv ../setup_wav2lip.py setup.py
pip install -e .
cd ..


##############
# Setup VITS
##############
git clone https://github.com/jaywalnut310/vits
cd vits

# Install Requirements
pip install -r requirements.txt

# Install monotonic_align
mv monotonic_align ../monotonic_align

# Download the VCTK checkpoint
pip install gdown
gdown https://drive.google.com/uc?id=11aHOlhnxzjpdWDpsz1vFDCzbeEfoIxru

# Install as package
mv ../setup_vits.py setup.py
pip install -e .

cd ../monotonic_align
python setup.py build_ext --inplace
cd ..


pip install flask
pip install wget

You'll need to do the last step manually (let me know if you know an automatic way). Download the checkpoint wav2lip_gan.pth from here and put it in Wav2Lip/checkpoints. Make sure you have espeak installed and it is in PATH.

Running

Start the remote service (I used port 31337)

python server.py --port 31337

You can send an example request locally, by running (don't forget to change host and port accordingly):

python request_demo.py

We also made a small 'playground' so you can see how slider values will influence the voice. Start the local flask app called client.py.

Experiment

The GSP experiment cannot be shared at this moment, as PsyNet is still under development.

You might also like...

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

10.8k Feb 18, 2021

This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

3 Oct 15, 2021

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

106 Jan 1, 2023

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Command Line Text-To-Speech using Google TTS

Comments

No reaction for changing PCA / speaker embedding

Hello You provide ljspeech trained model, but it has parameter 'n_speaker' = 0, so speaker embedding chanhing (by pca in flask playground or manually in code) doesn't take any effect

This is normal behaviour? It looks like all the infrastructure in project (for pca/spk_emd) is developed for VCTK model, but you don't provide it

I am an newbie in DS, so I might have misunderstood something. Thanks!

opened by NikitaKononov 1
just a small question

Really nice paper , i noticed voiceMe and wavthruvec got released on arxiv.org 2 days apart and voiceMe got the fast code implementation and wavthruvec didn't ???/ will it ever drop ?? the 2 papers are a true masterpiece !! sorry if i asked in the wrong place but didn't wanna annoy you by emailing you.

Thank you in advance : )

opened by dutchsing009 1

Repository for the paper: VoiceMe: Personalized voice generation in TTS

Related tags

Overview

🗣 VoiceMe: Personalized voice generation in TTS

Demos

Preprocessing

Create face styles

Obtain the PCA space

Synthesis

Setup

Running

Experiment

You might also like...

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

This project converts your human voice input to its text transcript and to an automated voice too.

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Command Line Text-To-Speech using Google TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Maix Speech AI lib, including ASR, chat, TTS etc.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Comments

No reaction for changing PCA / speaker embedding

just a small question

Owner

Pol van Rijn

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

vits chinese, tts chinese, tts mandarin

Ukrainian TTS (text-to-speech) using Coqui TTS

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

A multi-voice TTS system trained with an emphasis on quality

TTS is a library for advanced Text-to-Speech generation.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants