Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Descript

Last update: Dec 6, 2022

Related tags

Overview

Chunked Autoregressive GAN (CARGAN)

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [companion website]

Installation
Configuration
Inference
- CLI
- API
Reproducing results
- Download
- Partition
- Preprocess
- Train
- Evaluate
Running tests
Citation

Installation

pip install cargan

Configuration

All configuration is performed in cargan/constants.py. The default configuration is CARGAN. Additional configuration files for experiments described in our paper can be found in config/.

Inference

CLI

Infer from an audio files on disk. audio_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --audio_files 
   
     \
    --output_files 
    
      \
    --checkpoint 
     
       \
    --gpu

Infer from files of features on disk. feature_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --feature_files 
   
     \
    --output_files 
    
      \
    --checkpoint 
     
       \
    --gpu

API

`cargan.from_audio`

"""Perform vocoding from audio

Arguments
    audio : torch.Tensor(shape=(1, samples))
        The audio to vocode
    sample_rate : int
        The audio sample rate
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, samples))
        The vocoded audio
"""

`cargan.from_audio_file_to_file`

"""Perform vocoding from audio file and save to file

Arguments
    audio_file : Path
        The audio file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

`cargan.from_audio_files_to_files`

"""Perform vocoding from audio files and save to files

Arguments
    audio_files : list(Path)
        The audio files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

`cargan.from_features`

"""Perform vocoding from features

Arguments
    features : torch.Tensor(shape=(1, cargan.NUM_FEATURES, frames)
        The features to vocode
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, cargan.HOPSIZE * frames))
        The vocoded audio
"""

`cargan.from_feature_file_to_file`

"""Perform vocoding from feature file and save to disk

Arguments
    feature_file : Path
        The feature file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

`cargan.from_feature_files_to_files`

"""Perform vocoding from feature files and save to disk

Arguments
    feature_files : list(Path)
        The feature files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

Reproducing results

For the following subsections, the arguments are as follows

checkpoint - Path to an existing checkpoint on disk
datasets - A list of datasets to use. Supported datasets are vctk, daps, cumsum, and musdb.
gpu - The index of the gpu to use
gpus - A list of indices of gpus to use for distributed data parallelism (DDP)
name - The name to give to an experiment or evaluation
num - The number of samples to evaluate

Download

Downloads, unzips, and formats datasets. Stores datasets in data/datasets/. Stores formatted datasets in data/cache/.

python -m cargan.data.download --datasets

vctk must be downloaded before cumsum.

Preprocess

Prepares features for training. Features are stored in data/cache/.

python -m cargan.preprocess --datasets 
   
     --gpu

Running this step is not required for the cumsum experiment.

Partition

Partitions a dataset into training, validation, and testing partitions. You should not need to run this, as the partitions used in our work are provided for each dataset in cargan/assets/partitions/.

python -m cargan.partition --datasets

The optional --overwrite flag forces the existing partition to be overwritten.

Train

Trains a model. Checkpoints and logs are stored in runs/.

python -m cargan.train \
    --name 
   
     \
    --datasets 
    
      \
    --gpus

You can optionally specify a --checkpoint option pointing to the directory of a previous run. The most recent checkpoint will automatically be loaded and training will resume from that checkpoint. You can overwrite a previous training by passing the --overwrite flag.

You can monitor training via tensorboard as follows.

tensorboard --logdir runs/ --port

Evaluate

Objective

Reports the pitch RMSE (in cents), periodicity RMSE, and voiced/unvoiced F1 score. Results are both printed and stored in eval/objective/.

python -m cargan.evaluate.objective \
    --name 
   
     \
    --datasets 
    
      \
    --checkpoint 
     
       \
    --num 
      
        \
    --gpu

Subjective

Generates samples for subjective evaluation. Also performs benchmarking of inference speed. Results are stored in eval/subjective/.

python -m cargan.evaluate.subjective \
    --name 
   
     \
    --datasets 
    
      \
    --checkpoint 
     
       \
    --num 
      
        \
    --gpu

Receptive field

Get the size of the (non-causal) receptive field of the generator. cargan.AUTOREGRESSIVE must be False to use this.

python -m cargan.evaluate.receptive_field

Running tests

pip install pytest
pytest

Citation

IEEE

M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, "Chunked Autoregressive GAN for Conditional Waveform Synthesis," Submitted to ICLR 2022, April 2022.

BibTex

@inproceedings{morrison2022chunked,
    title={Chunked Autoregressive GAN for Conditional Waveform Synthesis},
    author={Morrison, Max and Kumar, Rithesh and Kumar, Kundan and Seetharaman, Prem and Courville, Aaron and Bengio, Yoshua},
    booktitle={Submitted to ICLR 2022},
    month={April},
    year={2022}
}

Comments

Will discriminator weights be released?

It would be helpful for finetuning. If not, maybe HiFi-GAN's Universal V1 discriminator could be used, though I'm not sure how much the changed feature matching/mel-spectrogram loss weighting will impact things.

opened by PluieElectrique 3
Pitch Losses

Hello, first of all thanks for sharing your work results and all the implementation.

I had noticed that the code implements PitchLoss term, but it is not used in any of the configs and you don't mention it in the article. Also I have seen that you implemented the PitchDiscriminator, but I had not noticed any results from using it.

Would you mind commenting on the results of using pitch as part of vocoder loss?

opened by Whyki 2
Pip package is missing submodules
I tried to import cargan after running pip install cargan. But, from . import model failed because the model module could not be found. Indeed, on PyPI, the 0.0.2 wheel and tar.gz only have the following source files:

cargan/__init__.py cargan/__main__.py cargan/constants.py cargan/core.py cargan/load.py cargan/partition.py cargan/train.py

This seems like a setup.py issue. Maybe find_packages() should be used. Or, submodules should be listed out explicitly (since find_packages() might include tests).
opened by PluieElectrique 1
Versions of torch and torchaudio to use on Colab?
UPDATE: !pip install torch==1.10.2 torchaudio==0.10.2 did the trick. Still not sure about how to use TensorBoard but closing this issue as my goal was to at least run training on Colab.

Hi,

This may be pretty Google-Colab-specific but I would appreciate guidance.

On Colab, I was trying to train CARGAN on VCTK. I ran into an exception on line 70 of train.py (writer = SummaryWriter(str(directory))). Exception pasted below:

[libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".) terminate called after throwing an instance of 'google::protobuf::FatalException'

Along the lines of this error message, I tried installing libprotobuf 3.9, but then got some sort of low-level C error (I'm forgetting details but can reproduce if helpful). Rather than investigate I commented out all the references to the writer object as I wanted to just get training to work as a first step (even w/o TensorBoard monitoring).

That allowed me to get further, line 523 of train.py (metrics.update(x_t, x_pred_t)), but this resulted in AttributeError: module 'torchaudio.functional' has no attribute 'magphase' on line 115 of metrics.py.

I assume this is a torchaudio version issue, so I did !pip uninstall torchaudio and then ran !pip install -e . from the repo root to reinstall it via setup.py, but got the same exception. I believe the old (already installed) torchaudio version was 0.12.1+cu113 and the reinstalled version was then 0.12.1+cu102. Colab appears to have CUDA 11.1 installed.

Anyways, I suppose I'm asking, does anyone have a recommendation of versions of torchaudio (and perhaps torch) to install to have the least chance of issues along these lines? Appreciate any and all help greatly.
opened by rohitgupta3 0
Pass sample rate to from_audio

Currently, core.from_audio_file_to_file does not pass the sample rate to core.from_audio. This causes the checkpoint path to be interpreted as the sample rate, which throws an error.

opened by PluieElectrique 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Poor results on Mandarin singing voice data

Thank you for your work. I used this repository to experiment on a Mandarin singing voice dataset, the training result of 50w steps is not satisfactory, the main problem is that the spectrum looks like stitching together one by one Chunk, there are very obvious vertical line streaks(can be clearly heard).

I am using the default hyperparameter configuration, how should I avoid this problem?

opened by WelkinYang 1
TypeError: can't convert np.ndarray of type numpy.uint16.
When I ran the code with my own dataset python -m cargan.preprocess --dataset ljspeech An error occured

Traceback (most recent call last):
File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "XX/models/cargan/cargan/preprocess/main.py", line 26, in
cargan.preprocess.datasets(**vars(parse_args()))
File "XX/models/cargan/cargan/preprocess/core.py", line 37, in datasets
mels, pitch, periodicity = from_audio(audio, gpu=gpu)
File "XX/models/cargan/cargan/preprocess/core.py", line 62, in from_audio
pitch, periodicity = cargan.preprocess.pitch.from_audio(
File "XX/models/cargan/cargan/preprocess/pitch.py", line 38, in from_audio
pitch, periodicity = torchcrepe.predict(
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 127, in predict
result = postprocess(probabilities,
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 605, in postprocess
bins, pitch = decoder(probabilities)
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/decode.py", line 76, in viterbi
bins = torch.tensor(bins, device=probs.device)
TypeError: can't convert np.ndarray of type numpy.uint16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

I guess it is cause by

# Perform viterbi decoding bins = [librosa.sequence.viterbi(sequence, viterbi.transition) for sequence in sequences] # Convert to pytorch bins = torch.tensor(bins, device=probs.device)

in torchcrepe\decode.py

The datatype of bins is numpy.unint 16. Whether I need to modify the code in torchcrepe ?
opened by zerlinwang 3
Training models with 24000 Hz audio data

Thank you for your nice works! If I would like to train CARGAN with 24000 Hz audio data, besides SAMPLE_RATE in cargan.constant.py, what other parts of the code do I need to modify?

opened by zerlinwang 3
Discriminator weights

I saw someone ask for these weights a few months ago and was just curious if these will get released/any updates? Appreciate it and great work on the speeding up the training time significantly.

opened by pranavmalikk 0
about the ar loop?

from the code: https://github.com/descriptinc/cargan/blob/61051faea3b8fffe0b02bf47d1737b5859633d99/cargan/core.py#L212

for each chunk output samples, it will be add to signals, but in the for loop, https://github.com/descriptinc/cargan/blob/61051faea3b8fffe0b02bf47d1737b5859633d99/cargan/core.py#L207 we have the feat_hop,, for my understand, it will cumsum on the signals, but we only need the first feat_hop * hop_size samples, right?

opened by azraelkuan 2

Owner

Descript

GitHub https://maxrmorrison.com/sites/cargan

DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time

DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time Introduction This is official implementation for DR-GAN (IEEE TCS

18 Dec 23, 2022

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

169 Dec 23, 2022

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

96 Jan 3, 2023

PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

DECOR-GAN PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement, Zhiqin Chen, Vladimir G. Kim, Matthew Fish

72 Dec 31, 2022

Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

4 Aug 28, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

PyTorch implementation for OCT-GAN Neural ODE-based Conditional Tabular GANs (WWW 2021)

OCT-GAN: Neural ODE-based Conditional Tabular GANs (OCT-GAN) Code for reproducing the experiments in the paper: Jayoung Kim*, Jinsung Jeon*, Jaehoon L

7 Dec 27, 2022

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

69 Dec 17, 2022

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

170 Jan 4, 2023

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

54 Aug 30, 2021

Official implementation of the paper DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows Official implementation of the paper DeFlow: Learning Complex Im

86 Nov 16, 2022

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

281 Dec 30, 2022

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

6 Nov 2, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

55 Dec 26, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

66 Nov 28, 2022

Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows WACV 2022 preprint:https://arxiv.org/abs/2107.1

156 Dec 28, 2022

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Related tags

Overview

Chunked Autoregressive GAN (CARGAN)

Table of contents

Installation

Configuration

Inference

CLI

API

cargan.from_audio

cargan.from_audio_file_to_file

cargan.from_audio_files_to_files

cargan.from_features

cargan.from_feature_file_to_file

cargan.from_feature_files_to_files

Reproducing results

Download

Preprocess

Partition

Train

Evaluate

Objective

Subjective

Receptive field

Running tests

Citation

IEEE

BibTex

Comments

Patching CVE-2007-4559

Owner

Descript

DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

efficient neural audio synthesis in the waveform domain

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch implementation for OCT-GAN Neural ODE-based Conditional Tabular GANs (WWW 2021)

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Official implementation of the paper DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

`cargan.from_audio`

`cargan.from_audio_file_to_file`

`cargan.from_audio_files_to_files`

`cargan.from_features`

`cargan.from_feature_file_to_file`

`cargan.from_feature_files_to_files`