Cross-Speaker-Emotion-Transfer - PyTorch Implementation
PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.
DATASET refers to the names of datasets such as
RAVDESS in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, install fairseq (official document, github) to utilize
LConvBlock. Please check here to resolve any issue on installing it. Note that
Dockerfile is provided for
Docker users, but you have to install fairseq manually.
You have to download the pretrained models and put them in
To extract soft emotion tokens from a reference audio, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET
Or, to use hard emotion tokens from an emotion id, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at
preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in
preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.
The supported datasets are
- RAVDESS: This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.
Your own language and dataset can be adapted following here.
For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in
preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
- To use Automatic Mixed Precision, append
--use_ampargument to the above command.
- The trainer assumes single-node multi-GPU training. To use specific GPUs, specify
CUDA_VISIBLE_DEVICES=<GPU_IDs>at the beginning of the above command.
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
- The current implementation is not trained in a semi-supervised way due to the small dataset size. But it can be easily activated by specifying target speakers and passing no emotion ID with no emotion classifier loss.
- In Decoder, 15 X 1 LConv Block is used instead of 17 X 1 due to memory issues.
- Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between
- DeepSpeaker on RAVDESS dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
- For vocoder, HiFi-GAN and MelGAN are supported.
Please cite this repository by the "Cite this repository" of About section (top right of the main page).