Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Ehab AlBadawy

Last update: Jan 5, 2023

Related tags

Deep Learning voice_conversion

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

Rewrite preprocess.py to handle:
- multi-process feature extraction
- display error messages for failed cases
Create:
- Notebook for data visualisation
Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.

Comments

I get error when start to train

Hi thank you for your great work. I am sorry I get problem when try to train. When I run this command : python train.py --model_name [name of the model] --dataset [path/to/dataset] I got this error : NameError: name 'transfer_combos' is not defined

opened by kikirizki 6
Evaluate and Transfer Plotting
Overview

Added evaluate.py - purpose to run on a test set without weight updates, and save out.pickle of conversions when done

Added plotting functionality to both evaluate.py and train.py

Updated Encoder model definition, removed final_block (shared_block) param

Updated preprocessing to be with defaultdict(list). Just for less verbosity, and consistency with what I did in evaluate.py for feats dictionary

Plotting Notes

In train.py it plots the first batch every epoch or few epochs (specified by plot_interval arg)

In evaluate.py plot_interval arg controls how often it plots a batch (with respect to every few batches, since no epochs in evaluate)

The code looks kind of repeated for plotting in utils.py. But figured its more readable this way as opposed to having a single method with lots of conditions. Also, need to separate eval from train plot functions, since train is also concerned with plotting the target used in updating weight - while eval isn't.

Train Plot

Eval Plot
opened by RussellSB 6
Indentation in src/train.py
Hi -- I get the following when I run training:

File "src/train.py", line 250 for i, batch in progress: ^ TabError: inconsistent use of tabs and spaces in indentation

Do you not get this? I have manually standardized the use of tabs/spaces in that file locally so that it runs. Maybe your environment deals with indentation differently. Regardless, was wondering if, to the extent you agree with what I'm seeing, you'd be open to a pull request on this file so it works "out of the box" (e.g. on Colab).

Thanks for any input
opened by rohitgupta3 3
Generating audio after model training

How to generate or convert is not mentioned in the README after model training is done. Can you please add that to readme as well?

Great work on this! I saw the results and they look promising.

opened by Himanshu-KF 2
The output vocals have serious electronic sounds

Hello, I used the pretrained model you provided to infer the audio provided in your github.io webpage. The result obtained has a very heavy electronic sound (the output result is pretty bad compared with your demo on your github.io webpage) Can you provide the pretrained model, method and parameters , which correspond to the github.io webpage demo? I will be very grateful.

opened by FashengChen0622 1
SSIM evaluation in Inference
Needed to specify src generator for reconstruction

Refactored G to G_trg for reference consistency

Computed for reconstruction (Should be higher quality as only encoded once)

Computed for cyclic reconstruction (Should be lower quality as encoded twice)

SSIM Results

These are based on the same two speakers of the main paper. SSIM is generally high, indicating effective content-encoding abilities from the shared encoder.

Cyclic reconstruction is lower than reconstruction since for it data passes through the encoder twice - and on the second encoding, the input to the encoder is fake instead of real (causing a noticable dilution of quality).

SSIM is higher for male than female, this may be attributed to the fact that there are more male samples than female samples in the dataset or to the fact there's less "blowing in to the microphone" artefacts in the male recordings.

| Target | Reconstruction | Cyclic Reconstruction | |:------:|:--------------:|:---------------------:| | Female | 0.86 | 0.71 | | Male | 0.89 | 0.79 |
opened by RussellSB 1
Inference wav directory support
Added wav directory support in inference.py

Chooses whether to infer on one audio file or whole dir based on whether --wav or --wavdir is used

Reformed output structure to mirror out_train better

All output is now localised in a directory "[model_name]_[epoch]_[generator]"

Each out_infer dir has three dirs: plots, ref, gen (this makes sending generated audio to WaveNet for processing more straightforward)
opened by RussellSB 1
Many-to-Many Style Transfer
Made code compatible with 2+ speakers. Tested for 4 speakers.

Rearranged preprocessing to not be A and B, but be wrt ids of speakers in dict (ie 0 and 1 for two initial speakers)

Modified training to work wrt cyclic combinations of each speaker

In train.py included a main method for calling train loop - otherwise, it crashes on Windows due to num_workers>0 in data_loader

Inference has source included in saved directory name when specified (keeps track of source for many-to-many).
opened by RussellSB 0
Inference and Refactoring
Inference:

Needs input wav specification

Needs modelname, epoch no. specification

Needs generator id specification (like whether to use G1 or G2)

Features plotting

Features parameterizable overlap for the sliding windows in global inference

Saves reconstructed output (as well as preprocessed input for comparison)

Refactoring:

Made the previous asserts to have warning messages not comments next to them

Fixed previous plots so now y-axis with values 0-128 is bottom-up not top-down

Updated gitignore

Example plot of inference:
opened by RussellSB 0
Train Eval Test Split
Introduced split in preprocess.py

Modified data_proc so it takes dataset param as a file not folder (useful for pointing to data sets other than the training one)
opened by RussellSB 0
Subtle Refactoring
Fixed n_spkr descriptions

Corrected minor types

Added pin_memory for faster training

Renamed shared_block to final_block in Encoder, since all encoder blocks are universally shared

Updated librosa.output.write_wav to sf.write since its outdated

Decouple naming of pickled data and model experiment directory
opened by RussellSB 0
model structure thinking
I have several questions:

why use shared blocks in generator when shared hidden code z already exist?

why not use data enhancement?

why use only one encoder?
opened by profection 0
Train.py skips first 99 epochs

When trying to run the train.py script the first 99 epochs are skipped without any progress and the 99th runs as expected. Any ideas on how to fix this? Using our own dataset with 1100 files with lengths between 5 and 20 seconds and 16 khz mono.

opened by Luca-Lob 1
Exploding losses during voice-conversion training

Thanks for the great repository. Unfortunatly I have a problem during the voice-conversion training.

After the first 2 Epochs I get exploding losses.

What's the reason for that and how can I solve this?

I am happy about every tip

Thanks in advance!

opened by neuronx1 6

Owner

Ehab AlBadawy

GitHub

Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

1 Jan 12, 2022

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

51 Oct 8, 2022

PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM)

Neuro-Symbolic Sudoku Solver PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM). Please n

60 Dec 10, 2022

Relaxed-machines - explorations in neuro-symbolic differentiable interpreters

Relaxed Machines Explorations in neuro-symbolic differentiable interpreters. Baby steps: inc_stop Libraries JAX Haiku Optax Resources Chapter 3 (∂4: A

6 Feb 2, 2022

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

110 Dec 24, 2022

Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

MaskCycleGAN-VC Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion. MaskCycleGAN-VC is the

86 Dec 25, 2022

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

31 Dec 7, 2022

An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022

Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

227 Dec 28, 2022

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S2VC Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In thi

81 Dec 15, 2022

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani We pres

282 Jan 1, 2023

Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

9 Nov 18, 2022

Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

Manga Filling with ScreenVAE SIGGRAPH ASIA 2020 | Project Website | BibTex This repository is for ScreenVAE introduced in the following paper "Manga F

30 Dec 24, 2022

A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

1 Oct 30, 2021

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

444 Dec 30, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

4 Sep 18, 2022

Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

10 Dec 26, 2022

Fast Neural Style for Image Style Transform by Pytorch

FastNeuralStyle by Pytorch Fast Neural Style for Image Style Transform by Pytorch This is famous Fast Neural Style of Paper Perceptual Losses for Real

81 Sep 3, 2022

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1. Data Preperation

2. Data Preprocessing

3. Training

3.1. Visualization

3.2. Saving Models

3.3. Epochs

3.4. Pretrained Model

4. Inference

4.1. Directory Input

4.2. Visualization

4.3. Reconstructive Evaluation

References

Citation

TODO:

Comments

Overview

Plotting Notes

Train Plot

Eval Plot

SSIM Results

Owner

Ehab AlBadawy

Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM)

Relaxed-machines - explorations in neuro-symbolic differentiable interpreters

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

An evaluation toolkit for voice conversion models.

Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Voice assistant - Voice assistant with python

Fast Neural Style for Image Style Transform by Pytorch