Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

  • Rewrite preprocess.py to handle:
    • multi-process feature extraction
    • display error messages for failed cases
  • Create:
    • Notebook for data visualisation
  • Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.
Comments
  • I get error when start to train

    I get error when start to train

    Hi thank you for your great work. I am sorry I get problem when try to train. When I run this command : python train.py --model_name [name of the model] --dataset [path/to/dataset] I got this error : NameError: name 'transfer_combos' is not defined

    opened by kikirizki 6
  • Evaluate and Transfer Plotting

    Evaluate and Transfer Plotting

    Overview

    • Added evaluate.py - purpose to run on a test set without weight updates, and save out.pickle of conversions when done
    • Added plotting functionality to both evaluate.py and train.py
    • Updated Encoder model definition, removed final_block (shared_block) param
    • Updated preprocessing to be with defaultdict(list). Just for less verbosity, and consistency with what I did in evaluate.py for feats dictionary

    Plotting Notes

    • In train.py it plots the first batch every epoch or few epochs (specified by plot_interval arg)
    • In evaluate.py plot_interval arg controls how often it plots a batch (with respect to every few batches, since no epochs in evaluate)
    • The code looks kind of repeated for plotting in utils.py. But figured its more readable this way as opposed to having a single method with lots of conditions. Also, need to separate eval from train plot functions, since train is also concerned with plotting the target used in updating weight - while eval isn't.

    Train Plot

    image

    Eval Plot

    image

    opened by RussellSB 6
  • Indentation in src/train.py

    Indentation in src/train.py

    Hi -- I get the following when I run training:

    File "src/train.py", line 250
        for i, batch in progress:
                                ^
    TabError: inconsistent use of tabs and spaces in indentation
    

    Do you not get this? I have manually standardized the use of tabs/spaces in that file locally so that it runs. Maybe your environment deals with indentation differently. Regardless, was wondering if, to the extent you agree with what I'm seeing, you'd be open to a pull request on this file so it works "out of the box" (e.g. on Colab).

    Thanks for any input

    opened by rohitgupta3 3
  • Generating audio after model training

    Generating audio after model training

    How to generate or convert is not mentioned in the README after model training is done. Can you please add that to readme as well?

    Great work on this! I saw the results and they look promising.

    opened by Himanshu-KF 2
  • The output vocals have serious electronic sounds

    The output vocals have serious electronic sounds

    Hello, I used the pretrained model you provided to infer the audio provided in your github.io webpage. The result obtained has a very heavy electronic sound (the output result is pretty bad compared with your demo on your github.io webpage) Can you provide the pretrained model, method and parameters , which correspond to the github.io webpage demo? I will be very grateful.

    opened by FashengChen0622 1
  • SSIM evaluation in Inference

    SSIM evaluation in Inference

    • Needed to specify src generator for reconstruction
    • Refactored G to G_trg for reference consistency
    • Computed for reconstruction (Should be higher quality as only encoded once)
    • Computed for cyclic reconstruction (Should be lower quality as encoded twice)

    SSIM Results

    These are based on the same two speakers of the main paper. SSIM is generally high, indicating effective content-encoding abilities from the shared encoder.

    • Cyclic reconstruction is lower than reconstruction since for it data passes through the encoder twice - and on the second encoding, the input to the encoder is fake instead of real (causing a noticable dilution of quality).
    • SSIM is higher for male than female, this may be attributed to the fact that there are more male samples than female samples in the dataset or to the fact there's less "blowing in to the microphone" artefacts in the male recordings.

    | Target | Reconstruction | Cyclic Reconstruction | |:------:|:--------------:|:---------------------:| | Female | 0.86 | 0.71 | | Male | 0.89 | 0.79 |

    opened by RussellSB 1
  • Inference wav directory support

    Inference wav directory support

    • Added wav directory support in inference.py
    • Chooses whether to infer on one audio file or whole dir based on whether --wav or --wavdir is used
    • Reformed output structure to mirror out_train better
    • All output is now localised in a directory "[model_name]_[epoch]_[generator]"
    • Each out_infer dir has three dirs: plots, ref, gen (this makes sending generated audio to WaveNet for processing more straightforward)
    opened by RussellSB 1
  • Many-to-Many Style Transfer

    Many-to-Many Style Transfer

    Made code compatible with 2+ speakers. Tested for 4 speakers.

    • Rearranged preprocessing to not be A and B, but be wrt ids of speakers in dict (ie 0 and 1 for two initial speakers)
    • Modified training to work wrt cyclic combinations of each speaker
    • In train.py included a main method for calling train loop - otherwise, it crashes on Windows due to num_workers>0 in data_loader
    • Inference has source included in saved directory name when specified (keeps track of source for many-to-many).
    opened by RussellSB 0
  • Inference and Refactoring

    Inference and Refactoring

    Inference:

    • Needs input wav specification
    • Needs modelname, epoch no. specification
    • Needs generator id specification (like whether to use G1 or G2)
    • Features plotting
    • Features parameterizable overlap for the sliding windows in global inference
    • Saves reconstructed output (as well as preprocessed input for comparison)

    Refactoring:

    • Made the previous asserts to have warning messages not comments next to them
    • Fixed previous plots so now y-axis with values 0-128 is bottom-up not top-down
    • Updated gitignore

    Example plot of inference:

    image

    opened by RussellSB 0
  • Train Eval Test Split

    Train Eval Test Split

    • Introduced split in preprocess.py
    • Modified data_proc so it takes dataset param as a file not folder (useful for pointing to data sets other than the training one)
    opened by RussellSB 0
  • Subtle Refactoring

    Subtle Refactoring

    • Fixed n_spkr descriptions
    • Corrected minor types
    • Added pin_memory for faster training
    • Renamed shared_block to final_block in Encoder, since all encoder blocks are universally shared
    • Updated librosa.output.write_wav to sf.write since its outdated
    • Decouple naming of pickled data and model experiment directory
    opened by RussellSB 0
  • model structure thinking

    model structure thinking

    I have several questions:

    1. why use shared blocks in generator when shared hidden code z already exist?
    2. why not use data enhancement?
    3. why use only one encoder?
    opened by profection 0
  • Train.py skips first 99 epochs

    Train.py skips first 99 epochs

    When trying to run the train.py script the first 99 epochs are skipped without any progress and the 99th runs as expected. Any ideas on how to fix this? Using our own dataset with 1100 files with lengths between 5 and 20 seconds and 16 khz mono.

    opened by Luca-Lob 1
  • Exploding losses during voice-conversion training

    Exploding losses during voice-conversion training

    Thanks for the great repository. Unfortunatly I have a problem during the voice-conversion training.

    After the first 2 Epochs I get exploding losses. grafik

    What's the reason for that and how can I solve this?

    I am happy about every tip

    Thanks in advance!

    opened by neuronx1 6
Owner
Ehab AlBadawy
Ehab AlBadawy
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

Rowan Zellers 51 Oct 8, 2022
PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM)

Neuro-Symbolic Sudoku Solver PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM). Please n

Ashutosh Hathidara 60 Dec 10, 2022
Relaxed-machines - explorations in neuro-symbolic differentiable interpreters

Relaxed Machines Explorations in neuro-symbolic differentiable interpreters. Baby steps: inc_stop Libraries JAX Haiku Optax Resources Chapter 3 (∂4: A

Nada Amin 6 Feb 2, 2022
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

MaskCycleGAN-VC Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion. MaskCycleGAN-VC is the

null 86 Dec 25, 2022
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

null 31 Dec 7, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

null 30 Aug 29, 2022
Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

Liu Songxiang 227 Dec 28, 2022
Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S2VC Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In thi

null 81 Dec 15, 2022
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani We pres

Aaron (Yinghao) Li 282 Jan 1, 2023
Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

null 9 Nov 18, 2022
Implementation for "Manga Filling Style Conversion with Screentone Variational Autoencoder" (SIGGRAPH ASIA 2020 issue)

Manga Filling with ScreenVAE SIGGRAPH ASIA 2020 | Project Website | BibTex This repository is for ScreenVAE introduced in the following paper "Manga F

null 30 Dec 24, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

null 444 Dec 30, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Dongkyu Lee 4 Sep 18, 2022
Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
Fast Neural Style for Image Style Transform by Pytorch

FastNeuralStyle by Pytorch Fast Neural Style for Image Style Transform by Pytorch This is famous Fast Neural Style of Paper Perceptual Losses for Real

Bengxy 81 Sep 3, 2022