The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

Overview

WSRGlow

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio samples can be found here.

Feel free to create issues or send an email to [email protected] if you have problems running the code.

Before running the code, you need to install the dependicies by pip install -r requirements.txt.

The configs for model architecture and training scheme is saved in config.yaml. You can overwrite some of the attributes by adding the --hparams flag when running a command. The general way to run a python script is

python $SRC$ --config $CONFIG$ --hparams $KEY1$=$VALUE1$,$KEY2$=$VALUE2$,...

See hparams.py for more details.

To prepare data

Before training, you need to binarize the data first. The raw wav files should be put in the hparams['raw_data_path']. The binarized data would be put in the hparams['binary_data_path'].

Specifically, for the VCTK corpus, the file structure should be like

.
|--data
    |--raw
        |--VCTK-Corpus
            |--wav48
                |--$WAVS
|--checkpoints
    |--wsrglow
    

where the model checkpoints are in checkpoints/wsrglow.

The command to binarize is

python binarizer.py --config config.yaml

To modify the architecture of the model

The current WSRGlow model in model.py is designed for x4 super-resolution and takes waveform, spectrogram and phase information as input.

To train

Run python train.py --config config.yaml on a GPU.

To infer

Change the code in infer.py to specify the checkpoint you want to load and the sample inputs you want to use for inference. Run python infer.py --config config.yaml on a GPU, modify the code for the correct path of checkpoints and wav files.

Comments
  • Add Cog config and web demo

    Add Cog config and web demo

    Hi @zkx06111! 👋

    This pull request adds an interactive web demo of the 2x upsampling checkpoint, based on your Colab notebook. You can try it out here: https://replicate.ai/zkx06111/wsrglow

    Under the hood I've used an open source tool called Cog to build a Docker image, that is run by the Replicate servers. The Docker image can also be downloaded from the website for people who want to use WSRGlow from the command line without installing any Python dependencies.

    If you click the "Sign in with GitHub" link you can edit the page and add more examples, and we'll feature your model on the Explore page. image

    In case you wonder who I am, I used to be a PhD student working on music source separation, and now I'm working on Replicate. I used to struggle to get baseline models working in my research, so we're building Cog and Replicate to make it easier to package trained models in a reproducible way. As part of that we're going around the internet making demos for our favorite models.

    opened by andreasjansson 1
  • Is it possible to implement reading any other files than WAV? (e.g. MKA (Matroska) files)

    Is it possible to implement reading any other files than WAV? (e.g. MKA (Matroska) files)

    Google Colab and replicate.com virtual machines have a necessity of chunking 44kHz WAV stereo file into 53 seconds parts, otherwise it throws out of memory CUDA error.

    I find it very comfortable to chunk WAVs using MKVToolnix which losslessly places WAV inside Matroska container. I kind of overcome the MKA issue by using MKVExtractGUI-2 with v.20 version of MKVToolnix (the only one compatible). I also used Lossless-Cut for pure WAVs, but it's more cumbersome and cannot really chunk a file every 53 seconds automatically. At least it has a merge option.

    opened by deton24 0
  • FileNotFoundError: [Errno 2] No such file or directory: ''

    FileNotFoundError: [Errno 2] No such file or directory: ''

    Hi, authors, Thank you for open sourcing this great repository.

    I ran python train.py --config config.yaml, and got this error: FileNotFoundError: [Errno 2] No such file or directory: ''

    Traceback (most recent call last):
      File "/home/wschoi/PycharmProjects/WSRGlow/train.py", line 345, in <module>
        WaveGlowTask4.start()
      File "/home/wschoi/PycharmProjects/WSRGlow/train.py", line 274, in start
        period=1 if hparams['save_ckpt'] else 100000
      File "/home/wschoi/PycharmProjects/WSRGlow/training_utils.py", line 23, in __init__
        os.makedirs(filepath, exist_ok=True)
      File "/home/wschoi/miniconda3/envs/wsrglow/lib/python3.7/os.py", line 223, in makedirs
        mkdir(name, mode)
    FileNotFoundError: [Errno 2] No such file or directory: ''
    
    Process finished with exit code 1
    

    I guess this error was occurred because args_work_dir was set to '' unless args.exp_name is not a default value.

    https://github.com/zkx06111/WSRGlow/blob/1b8fc4939c72b319efdb520ba2868eacf468ca18/hparams.py#L39-L42

    and then, hparams_['work_dir'] is set to args_work_dir regardless of work_dir of config.yaml.

    https://github.com/zkx06111/WSRGlow/blob/1b8fc4939c72b319efdb520ba2868eacf468ca18/hparams.py#L84-L86


    TLDR;

    This error is occurred only when args.exp_name == ''.

    For those who want to quickly reproduce train.py I would recommend a script like below.

    python train.py --config config.yaml --config config.yaml --exp_name WSRGlow

    opened by ws-choi 0
  • Real world application, upsampling historic recordings?

    Real world application, upsampling historic recordings?

    Hi I've been testing your model for a side project I'm working on. I'd like to take early historic recordings (1890-1920s), denoise & upsample them. I've already denoised them (amazingly so!), I'm trying to upsample using your model but it doesn't seem to be doing much. I used the code from the code lab and the config that's in the repo.

    Is this not a good application of the model or did I do something incorrectly?

    Here is the results I produced example_and_prediction_wav_files.zip

    Spectrogram - Top is the example wav (Thomas Edison speaking 1912), bottom is the prediction. I can't hear a discernible difference and I'm well versed in audio engineering.

    Screen Shot 2021-12-30 at 3 14 14 PM

    from infer import *

    set_hparams(config='config.yaml')

    model = WaveGlowMelHF(**hparams['waveglow_config']).cuda()

    load_ckpt(model, 'model_ckpt_best.pt') model.eval()

    fns = ['te_small.wav']

    sigma = 1 for lr_fn in fns: lr, sr = load_wav(lr_fn) print(f'sampling rate (lr) = {sr}') print(f'lr.shape = {lr.shape}', flush=True) with torch.no_grad(): pred = run(model, lr, sigma=sigma) print(lr.shape, pred.shape) pred_fn = f'pred_{lr_fn}' print(f'sampling rate = {sr * 2}') sf.write(open(pred_fn, 'wb'), pred, sr * 2)

    opened by go-dustin 0
  • Having Trouble in training: utils.tensors_to_scalars

    Having Trouble in training: utils.tensors_to_scalars

    Hello, I'm trying to run your code. I just ran train.py with commands in readme, (with additional argument --hparams work_dir=ccc). But faced this error.

    File "train.py", line 143, in training_step log_outputs = utils.tensors_to_scalars(log_outputs) AttributeError: module 'utils' has no attribute 'tensors_to_scalars'

    I looked over commit logs, but utils.py never had that function.

    opened by jc5201 1
  • distorted spectrograms after model

    distorted spectrograms after model

    Hi! I tried your pretrained checkpoint in colab and got some extra values at the spectrogram in the first case and broken harmonics in the second case. First audio is 44100Hz real speech (converted to 24k and then upscaled to 48k). Second audio is the output of text-to-speech system (22050, upscaled to 44100)

    I don't hear any noticeable difference in both audios, is this expected?

    image image

    this is the spectrogram representation in Audacity. Upper one is before, bottom is after. Mel scale

    opened by thepowerfuldeez 4
Owner
Kexun Zhang
Interested in linguistics. Former participant in programming contests.
Kexun Zhang
Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

AST: Audio Spectrogram Transformer Introduction Citing Getting Started ESC-50 Recipe Speechcommands Recipe AudioSet Recipe Pretrained Models Contact I

Yuan Gong 603 Jan 7, 2023
An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

Facebook Research 253 Jan 6, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 4, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 54 Aug 30, 2021
Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

Chunked Autoregressive GAN (CARGAN) Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [compan

Descript 150 Dec 6, 2022
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

Rishikesh (ऋषिकेश) 55 Dec 26, 2022
efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

Ben Hayes 169 Dec 23, 2022
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
pytorch implementation for Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network arXiv:1609.04802

PyTorch SRResNet Implementation of Paper: "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network"(https://arxiv.org/abs

Jiu XU 436 Jan 9, 2023
PyTorch implementation of Glow

glow-pytorch PyTorch implementation of Glow, Generative Flow with Invertible 1x1 Convolutions (https://arxiv.org/abs/1807.03039) Usage: python train.p

Kim Seonghyeon 433 Dec 27, 2022
Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Deep Constrained Least Squares for Blind Image Super-Resolution [Paper] This is the official implementation of 'Deep Constrained Least Squares for Bli

MEGVII Research 141 Dec 30, 2022
Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

null 81 Dec 28, 2022
Official implementation of the paper 'Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution' in CVPR 2022

LDL Paper | Supplementary Material Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution Jie Liang*, Hu

null 150 Dec 26, 2022
Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution Figure: Example visualization of the method and baseline as a

Oliver Hahn 16 Dec 23, 2022
[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

Junyong Lee 151 Dec 30, 2022
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
The official pytorch implemention of the CVPR paper "Temporal Modulation Network for Controllable Space-Time Video Super-Resolution".

This is the official PyTorch implementation of TMNet in the CVPR 2021 paper "Temporal Modulation Network for Controllable Space-Time VideoSuper-Resolu

Gang Xu 95 Oct 24, 2022