PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Keon Lee

Last update: Jan 8, 2023

Related tags

Deep Learning text-to-speech deep-neural-networks pytorch tts speech-synthesis generative-model semi-supervised-learning global-style-tokens neural-tts non-autoregressive parallel-tacotron non-ar emotion-transfer cross-speaker conditional-layer-normalization

Overview

Cross-Speaker-Emotion-Transfer - PyTorch Implementation

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.

Quickstart

DATASET refers to the names of datasets such as RAVDESS in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, install fairseq (official document, github) to utilize LConvBlock. Please check here to resolve any issue on installing it. Note that Dockerfile is provided for Docker users, but you have to install fairseq manually.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

To extract soft emotion tokens from a reference audio, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET

Or, to use hard emotion tokens from an emotion id, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.

Training

Datasets

The supported datasets are

RAVDESS: This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

To use Automatic Mixed Precision, append --use_amp argument to the above command.
The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

The current implementation is not trained in a semi-supervised way due to the small dataset size. But it can be easily activated by specifying target speakers and passing no emotion ID with no emotion classifier loss.
In Decoder, 15 X 1 LConv Block is used instead of 17 X 1 due to memory issues.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on RAVDESS dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For vocoder, HiFi-GAN and MelGAN are supported.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Comments

loading state dict ——size mismatch

I have a problem when I use your pre-trained model for synthesis. However, the following error happens:

RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans: size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 3]) from checkpoint, the shape in current model is torch.Size([2, 1, 3]). size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]).

opened by cythc 2
Closed Issue

Hi, I synthesized some samples with the provided pretrained models and the speaker embeedding from philipperemy's DeepSpeaker repo. However, the sampled results were bad in that all of the words were garbled and I could not hear any words.

I am not sure if I am doing anything wrong since I just cloned your repository, downloaded the RAVDESS data and did everything listed in the README.md. Based on how I was able to generate samples, I do not think I am doing anything wrong, but was anyone able to synthesize good speech? And to the author of this repo @keonlee9420 do you mind uploading some samples generated from the pretrained models from the README.md?

Thanks in advance.

opened by jinny1208 0

The generated wav is not good

Hi, thank you for open source the wonderful work ! I followed your instructions 1) install lightconv_cuda, 2) download the checkpoint, 3) download the speaker embedding npy. However, the generated result is not good.

Below is my running command

python3 synthesize.py \
  --text "Hello world" \
  --speaker_id Actor_22 \
  --emotion_id sad \
  --restore_step 450000 \
  --mode single \
  --dataset RAVDESS

# sh run.sh 
2022-11-30 13:45:22.626404: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Device of XSpkEmoTrans: cuda
Removing weight norm...
Raw Text Sequence: Hello world
Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}

ENV

python 3.6.8
fairseq                 0.10.2
torch                   1.7.0+cu110
CUDA 11.0

Hello world_Actor_22_sad

Hello world_Actor_22_sad.wav.zip

opened by pangtouyuqqq 1

Synthesis with other person out of RAVDESS
Hello author, Firstly, thank you for giving this repo, it is really nice. I have a question that:

I download CMU data with single person with 100 audios and make speaker embedding vector and synthesis with this, the performance is not good. I cannot detect any words.

Should we need to fine-tuning deep-speaker model to generate speaker embedding with my data.

Thank you
opened by hathubkhn 5

Error using the pretrained model

I'm trying to run synthesize with the pretrained model, like such:

python3 synthesize.py --text "This sentence is a test" --speaker_id Actor_01 --emotion_id neutral --restore_step 450000  --dataset RAVDESS --mode single

but I get an error in layer size:

Traceback (most recent call last):
  File "synthesize.py", line 206, in <module>
    model = get_model(args, configs, device, train=False,
  File "/home/jrings/diviai/installs/Cross-Speaker-Emotion-Transfer/utils/model.py", line 27, in get_model
    model.load_state_dict(model_dict, strict=False)
  File "<...>/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans:
	size mismatch for emotion_emb.etl.embed: copying a param with shape torch.Size([8, 64]) from checkpoint, the shape in current model is torch.Size([9, 64]).
	size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 1, 3]) from checkpoint, the shape in current model is torch.Size([2, 3]).
	size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
	size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
	size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
	size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
	size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
	size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).

opened by jrings 1

speaker embedding npy file not found

Hi,

I am facing the following issue while synthesizing using pretrained model.

Removing weight norm... Traceback (most recent call last): File "synthesize.py", line 234, in )) if load_spker_embed else None File "/home/sagar/tts/Cross-Speaker-Emotion-Transfer/venv/lib/python3.7/site-packages/numpy/lib/npyio.py", line 417, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: './preprocessed_data/RAVDESS/spker_embed/Actor_19-spker_embed.npy'

Please suggest any way out. Thanks in advance -Sagar

opened by raikarsagar 4

Releases(v0.2.0)

v0.2.0(Jun 12, 2022)
fix all issues so far

update pre-trained models and demo

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 21, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Keon Lee

GitHub

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

SEOVER-Master This code is the implementation of paper： SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

4 Feb 24, 2022

Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

pytorch-AdaIN This is an unofficial pytorch implementation of a paper, Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [Hua

873 Jan 6, 2023

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.13

140 Dec 21, 2022

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

24 Sep 7, 2022

Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

65 Dec 20, 2022

Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

3 Jan 4, 2023

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization Official PyTorch implementation for our URST (Ultra-Resolution Sty

148 Dec 27, 2022

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

5 Nov 11, 2022

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frede

92 Dec 9, 2022

Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, Pattern Recognition

USDAN The implementation of Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, which is accepte

11 Nov 3, 2022

ISBI 2022: Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image.

Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Introduction This repository contains the PyTorch implem

25 Nov 9, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

TorchSemiSeg [CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision by Xiaokang Chen1, Yuhui Yuan2, Gang Zeng1, Jingdong Wang

387 Jan 8, 2023

Pytorch implementation of "Forward Thinking: Building and Training Neural Networks One Layer at a Time"

forward-thinking-pytorch Pytorch implementation of Forward Thinking: Building and Training Neural Networks One Layer at a Time Requirements Python 2.7

65 Oct 6, 2022

A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

6 Oct 4, 2022

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

49 Jan 2, 2023

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

2 Dec 17, 2021

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021) 99% of the code in this repository originates from this link. ICCV 2021 pap

10 Feb 1, 2022

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Hybrid-Supervised Object Detection System Object detection system trained by hybrid-supervision/weakly semi-supervision (HSOD/WSSOD): This project is

5 Dec 10, 2022

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Related tags

Overview

Cross-Speaker-Emotion-Transfer - PyTorch Implementation

Quickstart

Dependencies

Inference

Batch Inference

Training

Datasets

Preprocessing

Training

TensorBoard

Notes

Citation

References

Comments

loading state dict ——size mismatch

Closed Issue

The generated wav is not good

Synthesis with other person out of RAVDESS

Error using the pretrained model

speaker embedding npy file not found

Releases(v0.2.0)

v0.2.0(Jun 12, 2022)

v0.1.0(Feb 21, 2022)

Owner

Keon Lee

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, Pattern Recognition

ISBI 2022: Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image.

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Pytorch implementation of "Forward Thinking: Building and Training Neural Networks One Layer at a Time"

A real-time speech emotion recognition application using Scikit-learn and gradio

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection