Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

MINDs Lab

Last update: Jan 4, 2023

Related tags

Deep Learning text-to-speech deep-learning pytorch tts speech-synthesis gan vocoder

Overview

UnivNet

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

This is an unofficial PyTorch implementation of Jang et al. (Kakao), UnivNet.

To-Do List

Release checkpoint of pre-trained model
Extract wav samples for audio sample page
Add results including validation loss graph

Key Features

According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN.
This repository uses the same mel-spectrogram function as the Official HiFi-GAN, which is compatible with NVIDIA/tacotron2.

Our default mel calculation hyperparameters are as below, following the original paper.

audio:
  n_mel_channels: 100
  filter_length: 1024
  hop_length: 256 # WARNING: this can't be changed.
  win_length: 1024
  sampling_rate: 24000
  mel_fmin: 0.0
  mel_fmax: 12000.0

You can modify the hyperparameters to be compatible with your acoustic model.

Prerequisites

The implementation needs following dependencies.

Python 3.6
PyTorch 1.6.0
NumPy 1.17.4 and SciPy 1.5.4
Install other dependencies in requirements.txt.
```
pip install -r requirements.txt
```

Datasets

Preparing Data

Download the training dataset. This can be any wav file with sampling rate 24,000Hz. The original paper used LibriTTS.
- LibriTTS train-clean-360 split tar.gz link
- Unzip and place its contents under datasets/LibriTTS/train-clean-360.
If you want to use wav files with a different sampling rate, please edit the configuration file (see below).

Note: The mel-spectrograms calculated from audio file will be saved as **.mel at first, and then loaded from disk afterwards.

Preparing Metadata

Following the format from NVIDIA/tacotron2, the metadata should be formatted as:

path_to_wav|transcript|speaker_id
path_to_wav|transcript|speaker_id
...

Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in datasets/metadata. 5% of the train-clean-360 utterances were randomly sampled for validation.

Since this model is a vocoder, the transcripts are NOT used during training.

Train

Preparing Configuration Files

Run cp config/default.yaml config/config.yaml and then edit config.yaml

Write down the root path of train/validation in the data section. The data loader parses list of files within the path recursively.

data:
  train_dir: 'datasets/'	# root path of train data (either relative/absoulte path is ok)
  train_meta: 'metadata/libritts_train_clean_360_train.txt'	# relative path of metadata file from train_dir
  val_dir: 'datasets/'		# root path of validation data
  val_meta: 'metadata/libritts_train_clean_360_val.txt'		# relative path of metadata file from val_dir

We provide the default metadata for LibriTTS train-clean-360 split.

Modify channel_size in gen to switch between UnivNet-c16 and c32.

gen:
  noise_dim: 64
  channel_size: 32 # 32 or 16
  dilations: [1, 3, 9, 27]
  strides: [8, 8, 4]
  lReLU_slope: 0.2

Training

python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN

Tensorboard

tensorboard --logdir logs/

If you are running tensorboard on a remote machine, you can open the tensorboard page by adding --bind_all option.

Inference

python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH

Pre-trained Model

A pre-trained model will be released soon. The model was trained on LibriTTS train-clean-360 split.

Results

See audio samples at https://mindslab-ai.github.io/univnet/

Comparison with the results on paper

Model	MOS	PESQ(↑)	RMSE(↓)
Recordings	4.16±0.09	4.50	0.000
Results in Paper (UnivNet-c32)	3.93±0.09	3.70	0.316
Ours (UnivNet-c32)	-	TBD	TBD

Note

This code is an unofficial implementation, there may be some differences from the original paper.

Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue.

Implementation Authors

Implementation authors are:

Special thanks to

License

This code is licensed under BSD 3-Clause License.

We referred following codes and repositories.

The overall structure of the repository is based on https://github.com/seungwonpark/melgan.
datasets/dataloader.py from https://github.com/NVIDIA/waveglow (BSD 3-Clause License)
model/mpd.py from https://github.com/jik876/hifi-gan (MIT License)
model/lvcnet.py from https://github.com/zceng/LVCNet (Apache License 2.0)
utils/stft_loss.py # Copyright 2019 Tomoki Hayashi # MIT License (https://opensource.org/licenses/MIT)

References

Papers

Datasets

LibriTTS

Comments

GAN loss for the first 200k steps

The paper says

We trained the generator with only auxiliary loss without discriminators in the first 200k steps.

but I don't think your training code reflects that, and starts combining stft_loss and score_loss right off the bat at step 0. Is there any reason behind this modification?

opened by tebin 2
The cause of mismatch for model params

for your note:

Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue.

it should be a mistake in your code: https://github.com/mindslab-ai/univnet/blob/df77c9a37f71e3d6be1b504e16abaf99ce131de3/model/lvcnet.py#L110

the default value should be 3, but u set to 1 in the code.

opened by azraelkuan 2

Config structure mismatch when trainer.py

Simple run with default config (all data is placed right) fails: $ python trainer.py -c config/default.yaml -n test_run

Traceback (most recent call last):rainer.py -c config/conf
  File "trainer.py", line 30, in <module>
    assert hp.data.train != '' and hp.data.validation != '', \
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 353, in __getattr__
    self._format_and_raise(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 470, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key train
    full_key: data.train
    object_type=dict

This assert on line 30 should be changed, i guess

opened by SolomidHero 1

Pitch too high

Hello,

I have trained a voice using your framework. I wanted to use it as a Vocoder for Grad-TTS. Unfortunately the voice that is created as a result is way too high in its pitch.

Could you provide me with a hint or advice how this can happen? Do I need to change some configs or can this happen in the inference? Do I need to pre-process the input wav?

opened by icklerly1 0
Training fail: EOFError: Ran out of input

Hi there, i am getting an error after 1 iteration of training and I cannot figure out the reason.

Do you have any idea what is causing the error EOFError: Ran out of input ? Thanks in advance!

The error looks like this: Loading train data: 0%| | 0/2732 [00:00<?, ?it/s] Traceback (most recent call last): File "trainer.py", line 44, in <module> train(0, args, args.checkpoint_path, hp, hp_str) File "/opt/3tbdrive1/products/voicesurfer/02.VoiceTraining/univnet/utils/train.py", line 125, in train for mel, audio in loader: File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/tqdm/std.py", line 1185, in __iter__ for obj in iterable: File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise raise exception EOFError: Caught EOFError in DataLoader worker process 0. Original Traceback (most recent call last): File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/3tbdrive1/products/voicesurfer/02.VoiceTraining/univnet/datasets/dataloader.py", line 61, in __getitem__ return self.my_getitem(idx) File "/opt/3tbdrive1/products/voicesurfer/02.VoiceTraining/univnet/datasets/dataloader.py", line 78, in my_getitem mel = self.get_mel(wavpath) File "/opt/3tbdrive1/products/voicesurfer/02.VoiceTraining/univnet/datasets/dataloader.py", line 96, in get_mel mel = torch.load(melpath, map_location='cpu') File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/opt/3tbdrive1/products/voicesurfer/venv/lib/python3.6/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

opened by icklerly1 1

Owner

MINDs Lab

MINDsLab provides AI platform and various AI engines based on deep machine learning.

GitHub https://mindslab-ai.github.io/univnet/

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

55 Dec 26, 2022

Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

2 Nov 15, 2021

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

96 Jan 3, 2023

Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

77 Dec 21, 2022

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

157 Jan 1, 2023

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

1.1k Jan 2, 2023

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation A Pytorch Implementation of the paper "Investigat

63 Nov 14, 2022

Tensorflow python implementation of "Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos"

Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos This repository is the official tensorflow python implementation

287 Jan 6, 2023

Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

HFGI: High-Fidelity GAN Inversion for Image Attribute Editing High-Fidelity GAN Inversion for Image Attribute Editing Update: We released the inferenc

371 Dec 30, 2022

Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

5 Jan 11, 2022

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话）. Many modifications t

291 Jan 2, 2023

FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

81 Dec 8, 2022

efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

169 Dec 23, 2022

Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Neural Spatio-Temporal Point Processes [arxiv] Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel Abstract. We propose a new class of parameterizations

75 Dec 19, 2022

《Towards High Fidelity Face Relighting with Realistic Shadows》(CVPR 2021)

Towards High Fidelity Face-Relighting with Realistic Shadows Andrew Hou, Ze Zhang, Michel Sarkis, Ning Bi, Yiying Tong, Xiaoming Liu. In CVPR, 2021. T

114 Dec 10, 2022

HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

134 Dec 27, 2022

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

31 Dec 8, 2022

This repository contains the code for using the H3DS dataset introduced in H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction

H3DS Dataset This repository contains the code for using the H3DS dataset introduced in H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction Access

72 Dec 10, 2022

A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

57 Jan 5, 2023

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Related tags

Overview

UnivNet

To-Do List

Key Features

Prerequisites

Datasets

Train

Inference

Pre-trained Model

Results

Note

Implementation Authors

License

References

Comments

GAN loss for the first 200k steps

The cause of mismatch for model params

Config structure mismatch when trainer.py

Pitch too high

Training fail: EOFError: Ran out of input

Owner

MINDs Lab

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Tensorflow python implementation of "Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos"

Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

Fast and Simple Neural Vocoder, the Multiband RNNMS

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

FFTNet vocoder implementation

efficient neural audio synthesis in the waveform domain

Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

《Towards High Fidelity Face Relighting with Realistic Shadows》(CVPR 2021)

HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

This repository contains the code for using the H3DS dataset introduced in H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction

A two-stage U-Net for high-fidelity denoising of historical recordings

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,