Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Overview

Chunked Autoregressive GAN (CARGAN)

PyPI License Downloads

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [companion website]

Table of contents

Installation

pip install cargan

Configuration

All configuration is performed in cargan/constants.py. The default configuration is CARGAN. Additional configuration files for experiments described in our paper can be found in config/.

Inference

CLI

Infer from an audio files on disk. audio_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --audio_files 
   
     \
    --output_files 
    
      \
    --checkpoint 
     
       \
    --gpu 
      

      
     
    
   

Infer from files of features on disk. feature_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --feature_files 
   
     \
    --output_files 
    
      \
    --checkpoint 
     
       \
    --gpu 
      

      
     
    
   

API

cargan.from_audio

"""Perform vocoding from audio

Arguments
    audio : torch.Tensor(shape=(1, samples))
        The audio to vocode
    sample_rate : int
        The audio sample rate
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, samples))
        The vocoded audio
"""

cargan.from_audio_file_to_file

"""Perform vocoding from audio file and save to file

Arguments
    audio_file : Path
        The audio file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_audio_files_to_files

"""Perform vocoding from audio files and save to files

Arguments
    audio_files : list(Path)
        The audio files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_features

"""Perform vocoding from features

Arguments
    features : torch.Tensor(shape=(1, cargan.NUM_FEATURES, frames)
        The features to vocode
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, cargan.HOPSIZE * frames))
        The vocoded audio
"""

cargan.from_feature_file_to_file

"""Perform vocoding from feature file and save to disk

Arguments
    feature_file : Path
        The feature file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_feature_files_to_files

"""Perform vocoding from feature files and save to disk

Arguments
    feature_files : list(Path)
        The feature files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

Reproducing results

For the following subsections, the arguments are as follows

  • checkpoint - Path to an existing checkpoint on disk
  • datasets - A list of datasets to use. Supported datasets are vctk, daps, cumsum, and musdb.
  • gpu - The index of the gpu to use
  • gpus - A list of indices of gpus to use for distributed data parallelism (DDP)
  • name - The name to give to an experiment or evaluation
  • num - The number of samples to evaluate

Download

Downloads, unzips, and formats datasets. Stores datasets in data/datasets/. Stores formatted datasets in data/cache/.

python -m cargan.data.download --datasets 
   

   

vctk must be downloaded before cumsum.

Preprocess

Prepares features for training. Features are stored in data/cache/.

python -m cargan.preprocess --datasets 
   
     --gpu 
    

    
   

Running this step is not required for the cumsum experiment.

Partition

Partitions a dataset into training, validation, and testing partitions. You should not need to run this, as the partitions used in our work are provided for each dataset in cargan/assets/partitions/.

python -m cargan.partition --datasets 
   

   

The optional --overwrite flag forces the existing partition to be overwritten.

Train

Trains a model. Checkpoints and logs are stored in runs/.

python -m cargan.train \
    --name 
   
     \
    --datasets 
    
      \
    --gpus 
     

     
    
   

You can optionally specify a --checkpoint option pointing to the directory of a previous run. The most recent checkpoint will automatically be loaded and training will resume from that checkpoint. You can overwrite a previous training by passing the --overwrite flag.

You can monitor training via tensorboard as follows.

tensorboard --logdir runs/ --port 
   

   

Evaluate

Objective

Reports the pitch RMSE (in cents), periodicity RMSE, and voiced/unvoiced F1 score. Results are both printed and stored in eval/objective/.

python -m cargan.evaluate.objective \
    --name 
   
     \
    --datasets 
    
      \
    --checkpoint 
     
       \
    --num 
      
        \
    --gpu 
        
       
      
     
    
   

Subjective

Generates samples for subjective evaluation. Also performs benchmarking of inference speed. Results are stored in eval/subjective/.

python -m cargan.evaluate.subjective \
    --name 
   
     \
    --datasets 
    
      \
    --checkpoint 
     
       \
    --num 
      
        \
    --gpu 
        
       
      
     
    
   

Receptive field

Get the size of the (non-causal) receptive field of the generator. cargan.AUTOREGRESSIVE must be False to use this.

python -m cargan.evaluate.receptive_field

Running tests

pip install pytest
pytest

Citation

IEEE

M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, "Chunked Autoregressive GAN for Conditional Waveform Synthesis," Submitted to ICLR 2022, April 2022.

BibTex

@inproceedings{morrison2022chunked,
    title={Chunked Autoregressive GAN for Conditional Waveform Synthesis},
    author={Morrison, Max and Kumar, Rithesh and Kumar, Kundan and Seetharaman, Prem and Courville, Aaron and Bengio, Yoshua},
    booktitle={Submitted to ICLR 2022},
    month={April},
    year={2022}
}
Comments
  • Will discriminator weights be released?

    Will discriminator weights be released?

    It would be helpful for finetuning. If not, maybe HiFi-GAN's Universal V1 discriminator could be used, though I'm not sure how much the changed feature matching/mel-spectrogram loss weighting will impact things.

    opened by PluieElectrique 3
  • Pitch Losses

    Pitch Losses

    Hello, first of all thanks for sharing your work results and all the implementation.

    I had noticed that the code implements PitchLoss term, but it is not used in any of the configs and you don't mention it in the article. Also I have seen that you implemented the PitchDiscriminator, but I had not noticed any results from using it.

    Would you mind commenting on the results of using pitch as part of vocoder loss?

    opened by Whyki 2
  • Pip package is missing submodules

    Pip package is missing submodules

    I tried to import cargan after running pip install cargan. But, from . import model failed because the model module could not be found. Indeed, on PyPI, the 0.0.2 wheel and tar.gz only have the following source files:

    cargan/__init__.py
    cargan/__main__.py
    cargan/constants.py
    cargan/core.py
    cargan/load.py
    cargan/partition.py
    cargan/train.py
    

    This seems like a setup.py issue. Maybe find_packages() should be used. Or, submodules should be listed out explicitly (since find_packages() might include tests).

    opened by PluieElectrique 1
  • Versions of torch and torchaudio to use on Colab?

    Versions of torch and torchaudio to use on Colab?

    UPDATE: !pip install torch==1.10.2 torchaudio==0.10.2 did the trick. Still not sure about how to use TensorBoard but closing this issue as my goal was to at least run training on Colab.


    Hi,

    This may be pretty Google-Colab-specific but I would appreciate guidance.

    On Colab, I was trying to train CARGAN on VCTK. I ran into an exception on line 70 of train.py (writer = SummaryWriter(str(directory))). Exception pasted below:

    [libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
    terminate called after throwing an instance of 'google::protobuf::FatalException'
    

    Along the lines of this error message, I tried installing libprotobuf 3.9, but then got some sort of low-level C error (I'm forgetting details but can reproduce if helpful). Rather than investigate I commented out all the references to the writer object as I wanted to just get training to work as a first step (even w/o TensorBoard monitoring).

    That allowed me to get further, line 523 of train.py (metrics.update(x_t, x_pred_t)), but this resulted in AttributeError: module 'torchaudio.functional' has no attribute 'magphase' on line 115 of metrics.py.

    I assume this is a torchaudio version issue, so I did !pip uninstall torchaudio and then ran !pip install -e . from the repo root to reinstall it via setup.py, but got the same exception. I believe the old (already installed) torchaudio version was 0.12.1+cu113 and the reinstalled version was then 0.12.1+cu102. Colab appears to have CUDA 11.1 installed.

    Anyways, I suppose I'm asking, does anyone have a recommendation of versions of torchaudio (and perhaps torch) to install to have the least chance of issues along these lines? Appreciate any and all help greatly.

    opened by rohitgupta3 0
  • Pass sample rate to from_audio

    Pass sample rate to from_audio

    Currently, core.from_audio_file_to_file does not pass the sample rate to core.from_audio. This causes the checkpoint path to be interpreted as the sample rate, which throws an error.

    opened by PluieElectrique 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Poor results on Mandarin singing voice data

    Poor results on Mandarin singing voice data

    Thank you for your work. I used this repository to experiment on a Mandarin singing voice dataset, the training result of 50w steps is not satisfactory, the main problem is that the spectrum looks like stitching together one by one Chunk, there are very obvious vertical line streaks(can be clearly heard). image image

    I am using the default hyperparameter configuration, how should I avoid this problem?

    opened by WelkinYang 1
  • TypeError: can't convert np.ndarray of type numpy.uint16.

    TypeError: can't convert np.ndarray of type numpy.uint16.

    When I ran the code with my own dataset python -m cargan.preprocess --dataset ljspeech An error occured

    Traceback (most recent call last):
    File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    File "XX/models/cargan/cargan/preprocess/main.py", line 26, in
    cargan.preprocess.datasets(**vars(parse_args()))
    File "XX/models/cargan/cargan/preprocess/core.py", line 37, in datasets
    mels, pitch, periodicity = from_audio(audio, gpu=gpu)
    File "XX/models/cargan/cargan/preprocess/core.py", line 62, in from_audio
    pitch, periodicity = cargan.preprocess.pitch.from_audio(
    File "XX/models/cargan/cargan/preprocess/pitch.py", line 38, in from_audio
    pitch, periodicity = torchcrepe.predict(
    File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 127, in predict
    result = postprocess(probabilities,
    File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 605, in postprocess
    bins, pitch = decoder(probabilities)
    File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/decode.py", line 76, in viterbi
    bins = torch.tensor(bins, device=probs.device)
    TypeError: can't convert np.ndarray of type numpy.uint16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

    I guess it is cause by

      # Perform viterbi decoding
        bins = [librosa.sequence.viterbi(sequence, viterbi.transition)
                for sequence in sequences]
        # Convert to pytorch
        bins = torch.tensor(bins, device=probs.device)
    

    in torchcrepe\decode.py

    The datatype of bins is numpy.unint 16. Whether I need to modify the code in torchcrepe ?

    opened by zerlinwang 3
  • Training models with 24000 Hz audio data

    Training models with 24000 Hz audio data

    Thank you for your nice works! If I would like to train CARGAN with 24000 Hz audio data, besides SAMPLE_RATE in cargan.constant.py, what other parts of the code do I need to modify?

    opened by zerlinwang 3
  • Discriminator weights

    Discriminator weights

    I saw someone ask for these weights a few months ago and was just curious if these will get released/any updates? Appreciate it and great work on the speeding up the training time significantly.

    opened by pranavmalikk 0
  • about the ar loop?

    about the ar loop?

    from the code: https://github.com/descriptinc/cargan/blob/61051faea3b8fffe0b02bf47d1737b5859633d99/cargan/core.py#L212

    for each chunk output samples, it will be add to signals, but in the for loop, https://github.com/descriptinc/cargan/blob/61051faea3b8fffe0b02bf47d1737b5859633d99/cargan/core.py#L207 we have the feat_hop,, for my understand, it will cumsum on the signals, but we only need the first feat_hop * hop_size samples, right?

    opened by azraelkuan 2
DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time

DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time Introduction This is official implementation for DR-GAN (IEEE TCS

Kang Liao 18 Dec 23, 2022
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 6, 2022
efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

Ben Hayes 169 Dec 23, 2022
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 3, 2023
PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

DECOR-GAN PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement, Zhiqin Chen, Vladimir G. Kim, Matthew Fish

Zhiqin Chen 72 Dec 31, 2022
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
PyTorch implementation for OCT-GAN Neural ODE-based Conditional Tabular GANs (WWW 2021)

OCT-GAN: Neural ODE-based Conditional Tabular GANs (OCT-GAN) Code for reproducing the experiments in the paper: Jayoung Kim*, Jinsung Jeon*, Jaehoon L

BigDyL 7 Dec 27, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 4, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 54 Aug 30, 2021
Official implementation of the paper DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows Official implementation of the paper DeFlow: Learning Complex Im

Valentin Wolf 86 Nov 16, 2022
Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

null 281 Dec 30, 2022
PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

null 6 Nov 2, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

Rishikesh (ऋषिकेश) 55 Dec 26, 2022
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

Chen Kai 66 Nov 28, 2022
Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows WACV 2022 preprint:https://arxiv.org/abs/2107.1

Denis 156 Dec 28, 2022