Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

Overview

MediumVC

MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utterance i spoken by X). The Ŷi are considered as SSIF. To build SingleVC, we employ a novel data augment strategy: pitch-shifted and duration-remained(PSDR) to produce paired asymmetrical training data. Then, based on pre-trained SingleVC, MediumVC performs an asymmetrical reconstruction task(Ŷi → X̂i). Due to the asymmetrical reconstruction mode, MediumVC achieves more efficient feature decoupling and fusion. Experiments demonstrate MediumVC performs strong robustness for unseen speakers across multiple public datasets. Here is the official implementation of the paper, MediumVC.

The following are the overall model architecture.

Model architecture

For the audio samples, please refer to our demo page. The more converted speeches can be found in "Demo/ConvertedSpeeches/".

Envs

You can install the dependencies with

pip install -r requirements.txt

Speaker Encoder

Dvector is a robust speaker verification (SV) system pre-trained on VoxCeleb1 using GE2E loss, and it produces 256-dim speaker embedding. In our evaluation on multiple datasets(VCTK with 30000 pairs, Librispeech with 30000 pairs and VCC2020 with 10000 pairs), the equal error rates(EERs)and thresholds(THRs) are recorded in Table. Then Dvector with THRs is also employed to calculate SV accuracy(ACC) of pairs produced by MediumVC and other contrast methods for objective evaluation. The more details can access paper.

Dataset VCTK LibriSpeech VCC2020
EER(%)/THR 7.71/0.462 7.95/0.337 1.06/0.432

Vocoder

The HiFi-GAN vocoder is employed to convert log mel-spectrograms to waveforms. The model is trained on universal datasets with 13.93M parameters. Through our evaluation, it can synthesize 22.05 kHz high-fidelity speeches over 4.0 MOS, even in cross-language or noisy environments.

Infer

You can download the pretrained model, and then edit "Any2Any/infer/infer_config.yaml".Test Samples could be organized as "wav22050/$figure$/*.wav".

python Any2Any/infer/infer.py

Train from scratch

Preprocessing

The corpus should be organized as "VCTK22050/$figure$/*.wav", and then edit the config file "Any2Any/pre_feature/preprocess_config.yaml".The output "spk_emb_mel_label.pkl" will be used for training.

python Any2Any/pre_feature/figure_spkemb_mel.py

Training

Please edit the paths of pretrained hifi-model,wav2mel,dvector,SingleVC in config file "Any2Any/config.yaml" at first.

python Any2Any/solver.py
You might also like...
Static Features Classifier - A static features classifier for Point-Could clusters using an Attention-RNN model

Static Features Classifier This is a static features classifier for Point-Could

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch
Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frede

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.13

Look Who’s Talking: Active Speaker Detection in the Wild

Look Who's Talking: Active Speaker Detection in the Wild Dependencies pip install -r requirements.txt In addition to the Python dependencies, ffmpeg

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

Demo for the paper
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification
SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

Comments
  • Inference Error

    Inference Error

    Hello, when I try to run the inference code, I get the following error: File "Any2Any/infer/infer.py", line 11, in from Any2Any import util ModuleNotFoundError: No module named 'Any2Any'

    I get this error when running on Google colab and locally on Windows. I believe this means that the code doesn't recognize Any2Any folder as a module which should be solved when init.py exists in the directory. But unfortunately, it still gives an error even when init.py exists.

    opened by AhmedHashish123 6
  • RuntimeError: Error(s) in loading state_dict for MagicModel:

    RuntimeError: Error(s) in loading state_dict for MagicModel:

    Hello, I have a problem when I try to run the inference code with pre trained model, I get the following error:

    【Solver】
    *********  [load]   ***********
    01/28 07:21:45 PM (Elapsed: 00:00:03) loading the model from /content/MediumVC/Any2Any/model/checkpoint-3000.pt
    Traceback (most recent call last):
      File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/content/MediumVC/Any2Any/infer/infer.py", line 95, in <module>
        solver = Solver(config)
      File "/content/MediumVC/Any2Any/infer/infer.py", line 28, in __init__
        self.resume_model(self.config['resume_path'])
      File "/content/MediumVC/Any2Any/infer/infer.py", line 56, in resume_model
        self.Generator.load_state_dict(checkpoint['Generator'])
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
        self.__class__.__name__, "\n\t".join(error_msgs)))
    RuntimeError: Error(s) in loading state_dict for MagicModel:
    	Missing key(s) in state_dict: "any2one.encoder.pre_block.0.conv_block1.conv_block.conv0.bias", "any2one.encoder.pre_block.0.conv_block1.conv_block.conv0.weight", "any2one.encoder.pre_block.0.conv_block2.conv_block.conv0.bias", "any2one.encoder.pre_block.0.conv_block2.conv_block.conv0.weight", "any2one.encoder.pre_block.0.adjust_dim_layer.bias", "any2one.encoder.pre_block.0.adjust_dim_layer.weight", "any2one.encoder.pre_block.1.conv_block1.conv_block.conv0.bias", "any2one.encoder.pre_block.1.conv_block1.conv_block.conv0.weight", "any2one.encoder.pre_block.1.conv_block2.conv_block.conv0.bias", "any2one.encoder.pre_block.1.conv_block2.conv_block.conv0.weight", "any2one.encoder.pre_block.1.adjust_dim_layer.bias", "any2one.encoder.pre_block.1.adjust_dim_layer.weight", "any2one.encoder.pre_block.2.conv_block1.conv_block.conv0.bias", "any2one.encoder.pre_block.2.conv_block1.conv_block.conv0.weight", "any2one.encoder.pre_block.2.conv_block2.conv_block.conv0.bias", "any2one.encoder.pre_block.2.conv_block2.conv_block.conv0.weight", "any2one.encoder.pre_block.2.adjust_dim_layer.bias", "any2one.encoder.pre_block.2.adjust_dim_layer.weight", "any2one.encoder.post_block.0.cross_attn.in_proj_weight", "any2one.encoder.post_block.0.cross_attn.in_proj_bias", "any2one.encoder.post_block.0.cross_attn.out_proj.weight", "any2one.encoder.post_block.0.cross_attn.out_proj.bias", "any2one.decoder.pre_conv_block.0.conv_block1.conv_block.conv0.bias", "any2one.decoder.pre_conv_block.0.conv_block1.conv_block.conv0.weight", "any2one.decoder.pre_conv_block.0.conv_block2.conv_block.conv0.bias", "any2one.decoder.pre_conv_block.0.conv_block2.conv_block.conv0.weight", "any2one.decoder.pre_conv_block.0.adjust_dim_layer.bias", "any2one.decoder.pre_conv_block.0.adjust_dim_layer.weight", "any2one.decoder.pre_attention_block.0.cross_attn.in_proj_weight", "any2one.decoder.pre_attention_block.0.cross_attn.in_proj_bias", "any2one.decoder.pre_attention_block.0.cross_attn.out_proj.weight", "any2one.decoder.pre_attention_block.0.cross_attn.out_proj.bias", "any2one.decoder.mel_linear1.weight", "any2one.decoder.mel_linear1.bias", "any2one.decoder.mel_linear2.weight", "any2one.decoder.mel_linear2.bias", "any2one.decoder.smoothers.0.self_attn.in_proj_weight", "any2one.decoder.smoothers.0.self_attn.in_proj_bias", "any2one.decoder.smoothers.0.self_attn.out_proj.weight", "any2one.decoder.smoothers.0.self_attn.out_proj.bias", "any2one.decoder.smoothers.0.conv0.bias", "any2one.decoder.smoothers.0.conv0.weight", "any2one.decoder.smoothers.0.conv1.bias", "any2one.decoder.smoothers.0.conv1.weight", "any2one.decoder.smoothers.1.self_attn.in_proj_weight", "any2one.decoder.smoothers.1.self_attn.in_proj_bias", "any2one.decoder.smoothers.1.self_attn.out_proj.weight", "any2one.decoder.smoothers.1.self_attn.out_proj.bias", "any2one.decoder.smoothers.1.conv0.bias", "any2one.decoder.smoothers.1.conv0.weight", "any2one.decoder.smoothers.1.conv1.bias", "any2one.decoder.smoothers.1.conv1.weight", "any2one.decoder.smoothers.2.self_attn.in_proj_weight", "any2one.decoder.smoothers.2.self_attn.in_proj_bias", "any2one.decoder.smoothers.2.self_attn.out_proj.weight", "any2one.decoder.smoothers.2.self_attn.out_proj.bias", "any2one.decoder.smoothers.2.conv0.bias", "any2one.decoder.smoothers.2.conv0.weight", "any2one.decoder.smoothers.2.conv1.bias", "any2one.decoder.smoothers.2.conv1.weight", "any2one.decoder.post_block.0.conv_block1.conv_block.conv0.bias", "any2one.decoder.post_block.0.conv_block1.conv_block.conv0.weight", "any2one.decoder.post_block.0.conv_block2.conv_block.conv0.bias", "any2one.decoder.post_block.0.conv_block2.conv_block.conv0.weight", "any2one.decoder.post_block.0.adjust_dim_layer.bias", "any2one.decoder.post_block.0.adjust_dim_layer.weight", "any2one.decoder.post_block.1.conv_block1.conv_block.conv0.bias", "any2one.decoder.post_block.1.conv_block1.conv_block.conv0.weight", "any2one.decoder.post_block.1.conv_block2.conv_block.conv0.bias", "any2one.decoder.post_block.1.conv_block2.conv_block.conv0.weight", "any2one.decoder.post_block.1.adjust_dim_layer.bias", "any2one.decoder.post_block.1.adjust_dim_layer.weight", "any2one.decoder.post_block.2.conv_block1.conv_block.conv0.bias", "any2one.decoder.post_block.2.conv_block1.conv_block.conv0.weight", "any2one.decoder.post_block.2.conv_block2.conv_block.conv0.bias", "any2one.decoder.post_block.2.conv_block2.conv_block.conv0.weight", "any2one.decoder.post_block.2.adjust_dim_layer.bias", "any2one.decoder.post_block.2.adjust_dim_layer.weight", "any2one.decoder.post_block.3.conv_block1.conv_block.conv0.bias", "any2one.decoder.post_block.3.conv_block1.conv_block.conv0.weight", "any2one.decoder.post_block.3.conv_block2.conv_block.conv0.bias", "any2one.decoder.post_block.3.conv_block2.conv_block.conv0.weight", "any2one.decoder.post_block.3.adjust_dim_layer.bias", "any2one.decoder.post_block.3.adjust_dim_layer.weight", "cont_encoder.conv_block0.0.conv_block1.conv_block.conv0.bias", "cont_encoder.conv_block0.0.conv_block1.conv_block.conv0.weight_g", "cont_encoder.conv_block0.0.conv_block1.conv_block.conv0.weight_v", "cont_encoder.conv_block0.0.conv_block2.conv_block.conv0.bias", "cont_encoder.conv_block0.0.conv_block2.conv_block.conv0.weight_g", "cont_encoder.conv_block0.0.conv_block2.conv_block.conv0.weight_v", "cont_encoder.conv_block0.0.adjust_dim_layer.bias", "cont_encoder.conv_block0.0.adjust_dim_layer.weight_g", "cont_encoder.conv_block0.0.adjust_dim_layer.weight_v", "cont_encoder.conv_block0.1.conv_block1.conv_block.conv0.bias", "cont_encoder.conv_block0.1.conv_block1.conv_block.conv0.weight_g", "cont_encoder.conv_block0.1.conv_block1.conv_block.conv0.weight_v", "cont_encoder.conv_block0.1.conv_block2.conv_block.conv0.bias", "cont_encoder.conv_block0.1.conv_block2.conv_block.conv0.weight_g", "cont_encoder.conv_block0.1.conv_block2.conv_block.conv0.weight_v", "cont_encoder.conv_block0.1.adjust_dim_layer.bias", "cont_encoder.conv_block0.1.adjust_dim_layer.weight_g", "cont_encoder.conv_block0.1.adjust_dim_layer.weight_v", "cont_encoder.attention_norm0.0.cross_attn.in_proj_weight", "cont_encoder.attention_norm0.0.cross_attn.in_proj_bias", "cont_encoder.attention_norm0.0.cross_attn.out_proj.weight", "cont_encoder.attention_norm0.0.cross_attn.out_proj.bias", "cont_encoder.conv_block1.0.conv_block1.conv_block.conv0.bias", "cont_encoder.conv_block1.0.conv_block1.conv_block.conv0.weight_g", "cont_encoder.conv_block1.0.conv_block1.conv_block.conv0.weight_v", "cont_encoder.conv_block1.0.conv_block2.conv_block.conv0.bias", "cont_encoder.conv_block1.0.conv_block2.conv_block.conv0.weight_g", "cont_encoder.conv_block1.0.conv_block2.conv_block.conv0.weight_v", "cont_encoder.conv_block1.0.adjust_dim_layer.bias", "cont_encoder.conv_block1.0.adjust_dim_layer.weight_g", "cont_encoder.conv_block1.0.adjust_dim_layer.weight_v", "cont_encoder.conv_block1.1.conv_block1.conv_block.conv0.bias", "cont_encoder.conv_block1.1.conv_block1.conv_block.conv0.weight_g", "cont_encoder.conv_block1.1.conv_block1.conv_block.conv0.weight_v", "cont_encoder.conv_block1.1.conv_block2.conv_block.conv0.bias", "cont_encoder.conv_block1.1.conv_block2.conv_block.conv0.weight_g", "cont_encoder.conv_block1.1.conv_block2.conv_block.conv0.weight_v", "cont_encoder.conv_block1.1.adjust_dim_layer.bias", "cont_encoder.conv_block1.1.adjust_dim_layer.weight_g", "cont_encoder.conv_block1.1.adjust_dim_layer.weight_v", "cont_encoder.attention_norm1.0.cross_attn.in_proj_weight", "cont_encoder.attention_norm1.0.cross_attn.in_proj_bias", "cont_encoder.attention_norm1.0.cross_attn.out_proj.weight", "cont_encoder.attention_norm1.0.cross_attn.out_proj.bias", "generator.pre_block0.0.conv_block1.conv_block.conv0.bias", "generator.pre_block0.0.conv_block1.conv_block.conv0.weight_g", "generator.pre_block0.0.conv_block1.conv_block.conv0.weight_v", "generator.pre_block0.0.conv_block2.conv_block.conv0.bias", "generator.pre_block0.0.conv_block2.conv_block.conv0.weight_g", "generator.pre_block0.0.conv_block2.conv_block.conv0.weight_v", "generator.pre_block0.0.adjust_dim_layer.bias", "generator.pre_block0.0.adjust_dim_layer.weight_g", "generator.pre_block0.0.adjust_dim_layer.weight_v", "generator.pre_block0.1.conv_block1.conv_block.conv0.bias", "generator.pre_block0.1.conv_block1.conv_block.conv0.weight_g", "generator.pre_block0.1.conv_block1.conv_block.conv0.weight_v", "generator.pre_block0.1.conv_block2.conv_block.conv0.bias", "generator.pre_block0.1.conv_block2.conv_block.conv0.weight_g", "generator.pre_block0.1.conv_block2.conv_block.conv0.weight_v", "generator.pre_block0.1.adjust_dim_layer.bias", "generator.pre_block0.1.adjust_dim_layer.weight_g", "generator.pre_block0.1.adjust_dim_layer.weight_v", "generator.attention0.cross_attn.in_proj_weight", "generator.attention0.cross_attn.in_proj_bias", "generator.attention0.cross_attn.out_proj.weight", "generator.attention0.cross_attn.out_proj.bias", "generator.pre_block1.0.conv_block1.conv_block.conv0.bias", "generator.pre_block1.0.conv_block1.conv_block.conv0.weight_g", "generator.pre_block1.0.conv_block1.conv_block.conv0.weight_v", "generator.pre_block1.0.conv_block2.conv_block.conv0.bias", "generator.pre_block1.0.conv_block2.conv_block.conv0.weight_g", "generator.pre_block1.0.conv_block2.conv_block.conv0.weight_v", "generator.pre_block1.0.adjust_dim_layer.bias", "generator.pre_block1.0.adjust_dim_layer.weight_g", "generator.pre_block1.0.adjust_dim_layer.weight_v", "generator.pre_block1.1.conv_block1.conv_block.conv0.bias", "generator.pre_block1.1.conv_block1.conv_block.conv0.weight_g", "generator.pre_block1.1.conv_block1.conv_block.conv0.weight_v", "generator.pre_block1.1.conv_block2.conv_block.conv0.bias", "generator.pre_block1.1.conv_block2.conv_block.conv0.weight_g", "generator.pre_block1.1.conv_block2.conv_block.conv0.weight_v", "generator.pre_block1.1.adjust_dim_layer.bias", "generator.pre_block1.1.adjust_dim_layer.weight_g", "generator.pre_block1.1.adjust_dim_layer.weight_v", "generator.attention1.cross_attn.in_proj_weight", "generator.attention1.cross_attn.in_proj_bias", "generator.attention1.cross_attn.out_proj.weight", "generator.attention1.cross_attn.out_proj.bias", "generator.smoothers.0.self_attn.in_proj_weight", "generator.smoothers.0.self_attn.in_proj_bias", "generator.smoothers.0.self_attn.out_proj.weight", "generator.smoothers.0.self_attn.out_proj.bias", "generator.smoothers.0.conv0.bias", "generator.smoothers.0.conv0.weight_g", "generator.smoothers.0.conv0.weight_v", "generator.smoothers.0.conv1.bias", "generator.smoothers.0.conv1.weight_g", "generator.smoothers.0.conv1.weight_v", "generator.smoothers.1.self_attn.in_proj_weight", "generator.smoothers.1.self_attn.in_proj_bias", "generator.smoothers.1.self_attn.out_proj.weight", "generator.smoothers.1.self_attn.out_proj.bias", "generator.smoothers.1.conv0.bias", "generator.smoothers.1.conv0.weight_g", "generator.smoothers.1.conv0.weight_v", "generator.smoothers.1.conv1.bias", "generator.smoothers.1.conv1.weight_g", "generator.smoothers.1.conv1.weight_v", "generator.smoothers.2.self_attn.in_proj_weight", "generator.smoothers.2.self_attn.in_proj_bias", "generator.smoothers.2.self_attn.out_proj.weight", "generator.smoothers.2.self_attn.out_proj.bias", "generator.smoothers.2.conv0.bias", "generator.smoothers.2.conv0.weight_g", "generator.smoothers.2.conv0.weight_v", "generator.smoothers.2.conv1.bias", "generator.smoothers.2.conv1.weight_g", "generator.smoothers.2.conv1.weight_v", "generator.post_block.0.conv_block1.conv_block.conv0.bias", "generator.post_block.0.conv_block1.conv_block.conv0.weight_g", "generator.post_block.0.conv_block1.conv_block.conv0.weight_v", "generator.post_block.0.conv_block2.conv_block.conv0.bias", "generator.post_block.0.conv_block2.conv_block.conv0.weight_g", "generator.post_block.0.conv_block2.conv_block.conv0.weight_v", "generator.post_block.0.adjust_dim_layer.bias", "generator.post_block.0.adjust_dim_layer.weight_g", "generator.post_block.0.adjust_dim_layer.weight_v", "generator.post_block.1.conv_block1.conv_block.conv0.bias", "generator.post_block.1.conv_block1.conv_block.conv0.weight_g", "generator.post_block.1.conv_block1.conv_block.conv0.weight_v", "generator.post_block.1.conv_block2.conv_block.conv0.bias", "generator.post_block.1.conv_block2.conv_block.conv0.weight_g", "generator.post_block.1.conv_block2.conv_block.conv0.weight_v", "generator.post_block.1.adjust_dim_layer.bias", "generator.post_block.1.adjust_dim_layer.weight_g", "generator.post_block.1.adjust_dim_layer.weight_v", "generator.post_block.2.conv_block1.conv_block.conv0.bias", "generator.post_block.2.conv_block1.conv_block.conv0.weight_g", "generator.post_block.2.conv_block1.conv_block.conv0.weight_v", "generator.post_block.2.conv_block2.conv_block.conv0.bias", "generator.post_block.2.conv_block2.conv_block.conv0.weight_g", "generator.post_block.2.conv_block2.conv_block.conv0.weight_v", "generator.post_block.2.adjust_dim_layer.bias", "generator.post_block.2.adjust_dim_layer.weight_g", "generator.post_block.2.adjust_dim_layer.weight_v", "generator.post_block.3.conv_block1.conv_block.conv0.bias", "generator.post_block.3.conv_block1.conv_block.conv0.weight_g", "generator.post_block.3.conv_block1.conv_block.conv0.weight_v", "generator.post_block.3.conv_block2.conv_block.conv0.bias", "generator.post_block.3.conv_block2.conv_block.conv0.weight_g", "generator.post_block.3.conv_block2.conv_block.conv0.weight_v", "generator.post_block.3.adjust_dim_layer.bias", "generator.post_block.3.adjust_dim_layer.weight_g", "generator.post_block.3.adjust_dim_layer.weight_v", "generator.post_block.4.conv_block1.conv_block.conv0.bias", "generator.post_block.4.conv_block1.conv_block.conv0.weight_g", "generator.post_block.4.conv_block1.conv_block.conv0.weight_v", "generator.post_block.4.conv_block2.conv_block.conv0.bias", "generator.post_block.4.conv_block2.conv_block.conv0.weight_g", "generator.post_block.4.conv_block2.conv_block.conv0.weight_v", "generator.post_block.4.adjust_dim_layer.bias", "generator.post_block.4.adjust_dim_layer.weight_g", "generator.post_block.4.adjust_dim_layer.weight_v". 
    	Unexpected key(s) in state_dict: "encoder.pre_block.0.conv_block1.conv_block.conv0.bias", "encoder.pre_block.0.conv_block1.conv_block.conv0.weight_g", "encoder.pre_block.0.conv_block1.conv_block.conv0.weight_v", "encoder.pre_block.0.conv_block2.conv_block.conv0.bias", "encoder.pre_block.0.conv_block2.conv_block.conv0.weight_g", "encoder.pre_block.0.conv_block2.conv_block.conv0.weight_v", "encoder.pre_block.0.adjust_dim_layer.bias", "encoder.pre_block.0.adjust_dim_layer.weight_g", "encoder.pre_block.0.adjust_dim_layer.weight_v", "encoder.pre_block.1.conv_block1.conv_block.conv0.bias", "encoder.pre_block.1.conv_block1.conv_block.conv0.weight_g", "encoder.pre_block.1.conv_block1.conv_block.conv0.weight_v", "encoder.pre_block.1.conv_block2.conv_block.conv0.bias", "encoder.pre_block.1.conv_block2.conv_block.conv0.weight_g", "encoder.pre_block.1.conv_block2.conv_block.conv0.weight_v", "encoder.pre_block.1.adjust_dim_layer.bias", "encoder.pre_block.1.adjust_dim_layer.weight_g", "encoder.pre_block.1.adjust_dim_layer.weight_v", "encoder.pre_block.2.conv_block1.conv_block.conv0.bias", "encoder.pre_block.2.conv_block1.conv_block.conv0.weight_g", "encoder.pre_block.2.conv_block1.conv_block.conv0.weight_v", "encoder.pre_block.2.conv_block2.conv_block.conv0.bias", "encoder.pre_block.2.conv_block2.conv_block.conv0.weight_g", "encoder.pre_block.2.conv_block2.conv_block.conv0.weight_v", "encoder.pre_block.2.adjust_dim_layer.bias", "encoder.pre_block.2.adjust_dim_layer.weight_g", "encoder.pre_block.2.adjust_dim_layer.weight_v", "encoder.post_block.0.cross_attn.in_proj_weight", "encoder.post_block.0.cross_attn.in_proj_bias", "encoder.post_block.0.cross_attn.out_proj.weight", "encoder.post_block.0.cross_attn.out_proj.bias", "decoder.pre_conv_block.0.conv_block1.conv_block.conv0.bias", "decoder.pre_conv_block.0.conv_block1.conv_block.conv0.weight_g", "decoder.pre_conv_block.0.conv_block1.conv_block.conv0.weight_v", "decoder.pre_conv_block.0.conv_block2.conv_block.conv0.bias", "decoder.pre_conv_block.0.conv_block2.conv_block.conv0.weight_g", "decoder.pre_conv_block.0.conv_block2.conv_block.conv0.weight_v", "decoder.pre_conv_block.0.adjust_dim_layer.bias", "decoder.pre_conv_block.0.adjust_dim_layer.weight_g", "decoder.pre_conv_block.0.adjust_dim_layer.weight_v", "decoder.pre_attention_block.0.cross_attn.in_proj_weight", "decoder.pre_attention_block.0.cross_attn.in_proj_bias", "decoder.pre_attention_block.0.cross_attn.out_proj.weight", "decoder.pre_attention_block.0.cross_attn.out_proj.bias", "decoder.mel_linear1.weight", "decoder.mel_linear1.bias", "decoder.mel_linear2.weight", "decoder.mel_linear2.bias", "decoder.smoothers.0.self_attn.in_proj_weight", "decoder.smoothers.0.self_attn.in_proj_bias", "decoder.smoothers.0.self_attn.out_proj.weight", "decoder.smoothers.0.self_attn.out_proj.bias", "decoder.smoothers.0.conv0.bias", "decoder.smoothers.0.conv0.weight_g", "decoder.smoothers.0.conv0.weight_v", "decoder.smoothers.0.conv1.bias", "decoder.smoothers.0.conv1.weight_g", "decoder.smoothers.0.conv1.weight_v", "decoder.smoothers.1.self_attn.in_proj_weight", "decoder.smoothers.1.self_attn.in_proj_bias", "decoder.smoothers.1.self_attn.out_proj.weight", "decoder.smoothers.1.self_attn.out_proj.bias", "decoder.smoothers.1.conv0.bias", "decoder.smoothers.1.conv0.weight_g", "decoder.smoothers.1.conv0.weight_v", "decoder.smoothers.1.conv1.bias", "decoder.smoothers.1.conv1.weight_g", "decoder.smoothers.1.conv1.weight_v", "decoder.smoothers.2.self_attn.in_proj_weight", "decoder.smoothers.2.self_attn.in_proj_bias", "decoder.smoothers.2.self_attn.out_proj.weight", "decoder.smoothers.2.self_attn.out_proj.bias", "decoder.smoothers.2.conv0.bias", "decoder.smoothers.2.conv0.weight_g", "decoder.smoothers.2.conv0.weight_v", "decoder.smoothers.2.conv1.bias", "decoder.smoothers.2.conv1.weight_g", "decoder.smoothers.2.conv1.weight_v", "decoder.post_block.0.conv_block1.conv_block.conv0.bias", "decoder.post_block.0.conv_block1.conv_block.conv0.weight_g", "decoder.post_block.0.conv_block1.conv_block.conv0.weight_v", "decoder.post_block.0.conv_block2.conv_block.conv0.bias", "decoder.post_block.0.conv_block2.conv_block.conv0.weight_g", "decoder.post_block.0.conv_block2.conv_block.conv0.weight_v", "decoder.post_block.0.adjust_dim_layer.bias", "decoder.post_block.0.adjust_dim_layer.weight_g", "decoder.post_block.0.adjust_dim_layer.weight_v", "decoder.post_block.1.conv_block1.conv_block.conv0.bias", "decoder.post_block.1.conv_block1.conv_block.conv0.weight_g", "decoder.post_block.1.conv_block1.conv_block.conv0.weight_v", "decoder.post_block.1.conv_block2.conv_block.conv0.bias", "decoder.post_block.1.conv_block2.conv_block.conv0.weight_g", "decoder.post_block.1.conv_block2.conv_block.conv0.weight_v", "decoder.post_block.1.adjust_dim_layer.bias", "decoder.post_block.1.adjust_dim_layer.weight_g", "decoder.post_block.1.adjust_dim_layer.weight_v", "decoder.post_block.2.conv_block1.conv_block.conv0.bias", "decoder.post_block.2.conv_block1.conv_block.conv0.weight_g", "decoder.post_block.2.conv_block1.conv_block.conv0.weight_v", "decoder.post_block.2.conv_block2.conv_block.conv0.bias", "decoder.post_block.2.conv_block2.conv_block.conv0.weight_g", "decoder.post_block.2.conv_block2.conv_block.conv0.weight_v", "decoder.post_block.2.adjust_dim_layer.bias", "decoder.post_block.2.adjust_dim_layer.weight_g", "decoder.post_block.2.adjust_dim_layer.weight_v", "decoder.post_block.3.conv_block1.conv_block.conv0.bias", "decoder.post_block.3.conv_block1.conv_block.conv0.weight_g", "decoder.post_block.3.conv_block1.conv_block.conv0.weight_v", "decoder.post_block.3.conv_block2.conv_block.conv0.bias", "decoder.post_block.3.conv_block2.conv_block.conv0.weight_g", "decoder.post_block.3.conv_block2.conv_block.conv0.weight_v", "decoder.post_block.3.adjust_dim_layer.bias", "decoder.post_block.3.adjust_dim_layer.weight_g", "decoder.post_block.3.adjust_dim_layer.weight_v". 
    
    opened by ahmadsab95 0
  • Issue with inference

    Issue with inference

    Hi there, I feel I have my folders set up correctly and dependencies installed, hence an output folder is generated at the start of the inference, however shortly after starting I receive the error:

    11/04 03:09:35 AM (Elapsed: 00:00:04) loading the model from Any2Any/model/checkpoint-3900.pt 11/04 03:09:36 AM (Elapsed: 00:00:04) config = {'hifi_model_path': 'hifivoice/pretrained/UNIVERSAL_V1/g_02500000', 'hifi_config_path': 'hifivoice/pretrained/UNIVERSAL_V1/config.json', 'wav2mel_model_path': 'Any2Any/model/dvector/pre_model/wav2mel.pt', 'dvector_model_path': 'Any2Any/model/dvector/pre_model/dvector-step250000.pt', 'pre_train_singlevc': True, 'singlevc_model_path': 'Any2Any/model/checkpoint-3000.pt', 'test_wav_dir': 'Any2Any/audio/in/', 'out_dir': 'Any2Any/audio/out/', 'batch_size': 1, 'resume_path': 'Any2Any/model/checkpoint-3900.pt', 'num_mels': 80, 'num_freq': 1025, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'num_workers': 1} param Generator size = 26.418132M Traceback (most recent call last): File "Any2Any/infer/infer.py", line 96, in solver.infer() File "Any2Any/infer/infer.py", line 60, in infer test_data_loader = self.get_test_data_loaders() File "Any2Any/infer/infer.py", line 43, in get_test_data_loaders test_filelist = get_infer_dataset_filelist(self.config["test_wav_dir"]) File "/content/drive/MyDrive/MediumVC/Any2Any/meldataset.py", line 109, in get_infer_dataset_filelist source_file = random.choice(source_file_list) File "/usr/lib/python3.7/random.py", line 261, in choice raise IndexError('Cannot choose from an empty sequence') from None IndexError: Cannot choose from an empty sequence

    Would greatly appreacite help here, thanks :)

    opened by corranmac 1
Owner
谷下雨
美中不足
谷下雨
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

MaskCycleGAN-VC Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion. MaskCycleGAN-VC is the

null 86 Dec 25, 2022
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

null 31 Dec 7, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

null 30 Aug 29, 2022
Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

Liu Songxiang 227 Dec 28, 2022
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani We pres

Aaron (Yinghao) Li 282 Jan 1, 2023
Pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"

MOSNet pytorch implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion" https://arxiv.org/abs/1904.08352 Dependency L

null 9 Nov 18, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022