Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Related tags

Deep Learning NANSY

Overview

NANSY:

Unofficial Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Notice

Papers' Demo

Check Authors' Demo page

Sample-Only Demo Page

Check Demo Page

Concerns

Among the various controllabilities, it is rather obvious that the voice conversion technique can be misused and potentially harm other people. 
More concretely, there are possible scenarios where it is being used by random unidentified users and contributing to spreading fake news. 
In addition, it can raise concerns about biometric security systems based on speech. 
To mitigate such issues, the proposed system should not be released without a consent so that it cannot be easily used by random users with malicious intentions. 
That being said, there is still a potential for this technology to be used by unidentified users. 
As a more solid solution, therefore, we believe a detection system that can discriminate between fake and real speech should be developed.

We provide both pretrained checkpoint of Discriminator network and inference code for this concern.

Environment

Requirements

pip install -r requirements.txt

Docker

Image

If using cu113 compatible environment, use Dockerfile
If using cu102 compatible environment, use Dockerfile-cu102

docker build -f Dockerfile -t nansy:v0.0 .

Container

After building appropriate image, use docker-compose or docker to run a container.
You may want to modify docker-compose.yml or docker_run_script.sh

docker-compose -f docker-compose.yml run --service-ports --name CONTAINER_NAME nansy_container bash
or
bash docker_run_script.sh

Pretrained hifi-gan

Download pretrained hifi-gan config and checkpoint
from hifi-gan to ./configs/hifi-gan/UNIVERSAL_V1

Pretrained Checkpoints

TODO

Datasets

Datasets used when training are:

VCTK:
- CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
- https://datashare.ed.ac.uk/handle/10283/3443
LibriTTS:
- Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
- https://openslr.org/60/
- train-clean-360 set
CSS10:
- CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
- https://github.com/Kyubyong/css10

Custom Datasets

Write your own code!
If inheriting datasets.custom.CustomDataset, self.data should be as:

self.data: list
self.data[i]: dict must have:
    'wav_path_22k': str = path_to_22k_wav_file
    'wav_path_16k': str = (optional) path_to_16k_wav_file
    'speaker_id': str = speaker_id

Train

If you prefer pytorch-lightning, python train.py -g 1

parser = argparse.ArgumentParser()
parser.add_argument("--config", type=str, default="configs/train_nansy.yaml")
parser.add_argument('-g', '--gpus', type=str,
                    help="number of gpus to use")
parser.add_argument('-p', '--resume_checkpoint_path', type=str, default=None,
                    help="path of checkpoint for resuming")
args = parser.parse_args()
return args

else python train_torch.py # TODO, not completely supported now

Configs Description

Edit configs/train_nansy.yaml.

Dataset settings

Adjust datasets.*.datasets list.
- Paths to dataset config files should be in the list

datasets:
  train:
    class: datasets.base.MultiDataset
    datasets: [
      # 'configs/datasets/css10.yaml',
        'configs/datasets/vctk.yaml',
        'configs/datasets/libritts360.yaml',
    ]

    mode: train
    batch_size: 32 # Depends on GPU Memory, Original paper used 32
    shuffle: True
    num_workers: 16 # Depends on available CPU cores

  eval:
    class: datasets.base.MultiDataset
    datasets: [
      # 'configs/datasets/css10.yaml',
        'configs/datasets/vctk.yaml',
        'configs/datasets/libritts360.yaml',
    ]

    mode: eval
    batch_size: 32
    shuffle: False
    num_workers: 4

Dataset Config

Dataset configs are at ./configs/datasets/.
You might want to replace /raid/vision/dhchoi/data to YOUR_PATH_DO_DATA, especially at path section.

class: datasets.vctk.VCTKDataset # implemented Dataset class name
load:
  audio: 'configs/audio/22k.yaml'

path:
  root: /raid/vision/dhchoi/data/
  wav22: /raid/vision/dhchoi/data/VCTK-Corpus/wav22
  wav16: /raid/vision/dhchoi/data/VCTK-Corpus/wav16
  txt: /raid/vision/dhchoi/data/VCTK-Corpus/txt
  timestamp: ./vctk-silence-labels/vctk-silences.0.92.txt

  configs:
    train: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_train.txt
    eval: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_val.txt
    test: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_test.txt

Model Settings

Comment out or Delete Discriminator section if no Discriminator needed.
Adjust optimizer class, lr and betas if needed.

models:
  Analysis:
    class: models.analysis.Analysis

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

  Synthesis:
    class: models.synthesis.Synthesis

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

  Discriminator:
    class: models.synthesis.Discriminator

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

Logging & Pytorch-lightning settings

For pytorch-lightning configs in section pl, check official docs

pl:
  checkpoint:
    callback:
      save_top_k: -1
      monitor: "train/backward"
      verbose: True
      every_n_epochs: 1 # epochs

  trainer:
    gradient_clip_val: 0 # don't clip (default value)
    max_epochs: 10000
    num_sanity_val_steps: 1
    fast_dev_run: False
    check_val_every_n_epoch: 1
    progress_bar_refresh_rate: 1
    accelerator: "ddp"
    benchmark: True

logging:
  log_dir: /raid/vision/dhchoi/log/nansy/ # PATH TO SAVE TENSORBOARD LOG FILES
  seed: "31" # Experiment Seed
  freq: 100 # Logging frequency (step)
  device: cuda # Training Device (used only in train_torch.py) 
  nepochs: 1000 # Max epochs to run

  save_files: [ # Files To save for each experiment
      './*.py',
      './*.sh',
      'configs/*.*',
      'datasets/*.*',
      'models/*.*',
      'utils/*.*',
  ]

Tensorboard

During training, tensorboard logger logs loss, spectrogram and audio.

tensorboard --logdir YOUR_LOG_DIR_AT_CONFIG/YOUR_SEED --bind_all

Inference

Generator

python inference.py or bash inference.sh

You may want to edit inferece.py for custom manipulation.

parser = argparse.ArgumentParser()
parser.add_argument('--path_audio_conf', type=str, default='configs/audio/22k.yaml',
                    help='')
parser.add_argument('--path_ckpt', type=str, required=True,
                    help='path to pl checkpoint')
parser.add_argument('--path_audio_source', type=str, required=True,
                    help='path to source audio file, sr=22k')
parser.add_argument('--path_audio_target', type=str, required=True,
                    help='path to target audio file, sr=16k')
parser.add_argument('--tsa_loop', type=int, default=100,
                    help='iterations for tsa')
parser.add_argument('--device', type=str, default='cuda',
                    help='')
args = parser.parse_args()
return args

Discriminator

Note that 0=gt, 1=gen

python classify.py or bash classify.sh

parser = argparse.ArgumentParser()
parser.add_argument('--path_audio_conf', type=str, default='configs/audio/22k.yaml',
                    help='')
parser.add_argument('--path_ckpt', type=str, required=True,
                    help='path to pl checkpoint')
parser.add_argument('--path_audio_gt', type=str, required=True,
                    help='path to audio with same speaker')
parser.add_argument('--path_audio_gen', type=str, required=True,
                    help='path to generated audio ')
parser.add_argument('--device', type=str, default='cuda')
args = parser.parse_args()

License

NEEDS WORK

BSD 3-Clause License.

model/hifi_gan.py, utils/mel.py, pretrained checkpoints are copied/modified from https://github.com/jik876/hifi-gan (MIT License)
Wav2Vec2 (MIT License) pretrained checkpoint ported to HuggingFace (Apache License 2.0)

References

Choi, Hyeong-Seok, et al. "Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations."
Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations."
Desplanques, Brecht, Jenthe Thienpondt, and Kris Demuynck. "Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification."
Chen, Mingjian, et al. "Adaspeech: Adaptive text to speech for custom voice."
Cookbook formulae for audio equalizer biquad filter coefficients

This implementation uses codes/data from following repositories:

Provided Checkpoints are trained from:

Special Thanks

MINDsLab Inc. for GPU support

Special Thanks to:

for help with Audio-domain knowledge

Comments

Different wav2vec layers used

The speaker should use layer 1, the linguistic input should use layer 12 (as in the paper). I noticed in your implantation it is the other way around, did you get satisfactory results despite that?

opened by PiotrDabkowski 5
Pitch shifting is not done

Hello and thank you for your amazing work, especially for comments like this and this, they are super helpful. I have a question about apply_formant_and_pitch_shift function from here, after computingnewMedian here, its value is not used anywhere, and Parselmouth function "Shift frequencies" (something like call(pitch_tier, "Shift frequencies", sound.xmin, sound.xmax, -30, "Hertz")) is not being called either. Am I missing something?

opened by AndreyBocharnikov 1
Preprocessing before feeding wav2vec2.0
First, I appreciate your opensource contributions.

huggingface:Wav2Vec2ForPretraining and huggingface:Wav2Vec2Model use Wav2Vec2Processor or Wav2Vec2FeatureExtractore as their preprocessor. In git:huggingfac/transformers-Wav2VecFeatureExtractor#L222, preprocessor normalize the inputs with their mean and variance if do_normalize is True. Also facebook/wav2vec2-large-xlsr-53 enables it too, do_normalize.

>>> ext = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-large-xlsr-53') >>> ext Wav2Vec2FeatureExtractor { "do_normalize": true, "feature_extractor_type": "Wav2Vec2FeatureExtractor", "feature_size": 1, "padding_side": "right", "padding_value": 0, "return_attention_mask": true, "sampling_rate": 16000 }

I trace your codes but I cannot find any audio normalizing routines before feed it to wav2vec2.0 (before linguistic, custom dataset or inference). Is wav2vec2 working without the audio preprocessing ?
opened by revsic 0
Not able to replicate results. Colab notebook available to check it

Hi! The project looks really awesome thanks for sharing it, I made this colab and tried replicating the demos but I got pretty poor results even with a dataset were both speakers say exactly the same.

You can check the poor quality of the audios in the link. Let me know if you see something wrong in my process, I trained it for around 48 hours with a single colab GPU, the loss value stopped decreasing.

opened by mathigatti 0
Results :(

Are you still getting poor results on voice conversion? Can you share the model checkpoints if possible?

Also, are you willing to work more on this to improve the results?

opened by Prashil99 0
Question about the effect of vc

I ran the code once using vctk, but the conversion didn't work well. Is there any data preprocessing needed? Like VAD? I often see the warning: "PraatWarning: There were no voiced segments found."

opened by tobefans 1
KeyError: 'state_dict'

Traceback (most recent call last): File "inference.py", line 322, in final_audios = main() File "inference.py", line 88, in main state_dict = data_ckpt["state_dict"] KeyError: 'state_dict'

opened by MMMMichaelzhang 1

Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Related tags

Overview

NANSY:

Notice

Papers' Demo

Sample-Only Demo Page

Concerns

Environment

Requirements

Docker

Image

Container

Pretrained hifi-gan

Pretrained Checkpoints

Datasets

Custom Datasets

Train

Configs Description

Dataset settings

Dataset Config

Model Settings

Logging & Pytorch-lightning settings

Tensorboard

Inference

Generator

Discriminator

License

References

Special Thanks

Comments

Different wav2vec layers used

Pitch shifting is not done

Preprocessing before feeding wav2vec2.0

Not able to replicate results. Colab notebook available to check it

Results :(

Question about the effect of vc

KeyError: 'state_dict'

Owner

Dongho Choi 최동호

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Implementation of CVPR'2022:Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

[ICRA2021] Reconstructing Interactive 3D Scene by Panoptic Mapping and CAD Model Alignment

Code for "Reconstructing 3D Human Pose by Watching Humans in the Mirror", CVPR 2021 oral

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

PyTorch implementation of Tacotron speech synthesis model.

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

A static analysis library for computing graph representations of Python programs suitable for use with graph neural networks.

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021