Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

Brent M. Spell

Last update: Dec 30, 2022

Related tags

Deep Learning hifi-gan-bwe

Overview

HiFi-GAN+

This project is an unoffical implementation of the HiFi-GAN+ model for audio bandwidth extension, from the paper Bandwidth Extension is All You Need by Jiaqi Su, Yunyun Wang, Adam Finkelstein, and Zeyu Jin.

The model takes a band-limited audio signal (usually 8/16/24kHz) and attempts to reconstruct the high frequency components needed to restore a full-band signal at 48kHz. This is useful for upsampling low-rate outputs from upstream tasks like text-to-speech, voice conversion, etc. or enhancing audio that was filtered to remove high frequency noise. For more information, please see this blog post.

Status

Usage

The example below uses a pretrained HiFi-GAN+ model to upsample a 1 second 24kHz sawtooth to 48kHz.

import torch
from hifi_gan_bwe import BandwidthExtender

model = BandwidthExtender.from_pretrained("hifi-gan-bwe-10-42890e3-vctk-48kHz")

fs = 24000
x = torch.full([fs], 261.63 / fs).cumsum(-1) % 1.0 - 0.5
y = model(x, fs)

There is a Gradio demo on HugggingFace Spaces where you can upload audio clips and run the model. You can also run the model on Colab with this notebook.

Running with pipx

The HiFi-GAN+ library can be run directly from PyPI if you have the pipx application installed. The following script uses a hosted pretrained model to upsample an MP3 file to 48kHz. The input audio can be in any format supported by the audioread library, and the output can be in any format supported by soundfile.

pipx run --python=python3.9 hifi-gan-bwe \
  hifi-gan-bwe-10-42890e3-vctk-48kHz \
  input.mp3 \
  output.wav

Running in a Virtual Environment

If you have a Python 3.9 virtual environment installed, you can install the HiFi-GAN+ library into it and run synthesis, training, etc. using it.

pip install hifi-gan-bwe

hifi-synth hifi-gan-bwe-10-42890e3-vctk-48kHz input.mp3 output.wav

Pretrained Models

The following models can be loaded with BandwidthExtender.from_pretrained and used for audio upsampling. You can also download the model file from the link and use it offline.

Name	Sample Rate	Parameters	Wandb Metrics	Notes
hifi-gan-bwe-10-42890e3-vctk-48kHz	48kHz	1M	bwe-10-42890e3	Same as bwe-05, but uses bandlimited interpolation for upsampling, for reduced noise and aliasing. Uses the same parameters as resampy's kaiser_best mode.
hifi-gan-bwe-11-d5f542d-vctk-8kHz-48kHz	48kHz	1M	bwe-11-d5f542d	Same as bwe-10, but trained only on 8kHz sources, for specialized upsampling.
hifi-gan-bwe-12-b086d8b-vctk-16kHz-48kHz	48kHz	1M	bwe-12-b086d8b	Same as bwe-10, but trained only on 16kHz sources, for specialized upsampling.
hifi-gan-bwe-13-59f00ca-vctk-24kHz-48kHz	48kHz	1M	bwe-13-59f00ca	Same as bwe-10, but trained only on 24kHz sources, for specialized upsampling.
hifi-gan-bwe-05-cd9f4ca-vctk-48kHz	48kHz	1M	bwe-05-cd9f4ca	Trained for 200K iterations on the VCTK speech dataset with noise agumentation from the DNS Challenge dataset.

Training

If you want to train your own model, you can use any of the methods above to install/run the library or fork the repo and run the script commands locally. The following commands are supported:

Name	Description
hifi-train	Starts a new training run, pass in a name for the run.
hifi-clone	Clone an existing training run at a given or the latest checkpoint.
hifi-export	Optimize a model for inference and export it to a PyTorch model file (.pt).
hifi-synth	Run model inference using a trained model on a source audio file.

For example, you might start a new training run called bwe-01 with the following command:

hifi-train 01

To train a model, you will first need to download the VCTK and DNS Challenge datasets. By default, these datasets are assumed to be in the ./data/vctk and ./data/dns directories. See train.py for how to specify your own training data directories. If you want to use a custom training dataset, you can implement a dataset wrapper in datasets.py.

The training scripts use wandb.ai for experiment tracking and visualization. Wandb metrics can be disabled by passing --no_wandb to the training script. All of my own experiment results are publicly available at wandb.ai/brentspell/hifi-gan-bwe.

Each training run is identified by a name and a git hash (ex: bwe-01-8abbca9). The git hash is used for simple experiment tracking, reproducibility, and model provenance. Using git to manage experiments also makes it easy to change model hyperparameters by simply changing the code, making a commit, and starting the training run. This is why there is no hyperparameter configuration file in the project, since I often end up having to change the code anyway to run interesting experiments.

Development

Setup

The following script creates a virtual environment using pyenv for the project and installs dependencies.

pyenv install 3.9.10
pyenv virtualenv 3.9.10 hifi-gan-bwe
pip install -r requirements.txt

If you want to run the hifi-* scripts described above in development, you can install the package locally:

pip install -e .

You can then run tests, etc. follows:

pytest --cov=hifi_gan_bwe
black .
isort --profile=black .
flake8 .
mypy .

These checks are also included in the pre-commit configuration for the project, so you can set them up to run automatically on commit by running

pre-commit install

Acknowledgements

The original research on the HiFi-GAN+ model is not my own, and all credit goes to the paper's authors. I also referred to kan-bayashi's excellent Parallel WaveGAN implementation, specifically the WaveNet module. If you use this code, please cite the original paper:

@inproceedings{su2021bandwidth,
  title={Bandwidth extension is all you need},
  author={Su, Jiaqi and Wang, Yunyun and Finkelstein, Adam and Jin, Zeyu},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={696--700},
  year={2021},
  organization={IEEE},
  url={https://doi.org/10.1109/ICASSP39728.2021.9413575},
}

License

Licensed under the MIT License (the "License"). You may not use this package except in compliance with the License. You may obtain a copy of the License at

https://opensource.org/licenses/MIT

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

Very high CPU and memory usage whth hifi-synth
hifi-synth hifi-gan-bwe-10-42890e3-vctk-48kHz input.mp3 output.wav

Can I enable GPU acceleration if the running memory and CPU consumption are too much?
opened by Baiyuetribe 7
audio data frame size in inference time

In current implementation inference-time entire audio data proceesing at a time for recunstructing high quality signal.

Modified the processing logic & Observed performance degradation with processing frame by frame in infence time.

any suggestion is highly appreciated.

opened by saivinaypsv 7
I would like to know about the training data

Hello. Your implementation is very nice and I would like to use your pre-trained model as a baseline model in my research. I want to use the same training data to make a fair comparison but I cannot find the information about it. So, if possible, would you please publish the division of the VCTK dataset? Or it is alright to send me the information via email or something else. Best regards.

opened by chomeyama 4
Hugging Face Space: Aliasing in audio

tl;dr I suspect the input audio is getting resampled without a lowpass filter or otherwise improperly and then it's going to hifi-gan-bwe in the Hugging Face Space demo, uncertain if issue exists in repo.

I don't know if this issue is present in this code or just the Hugging Face interface, but in the Hugging Face Space demo there appears to be an issue with aliasing. If I feed it the same sound at two sample rates, one at the original sample rate of 20825hz (E-mu Drumulator sample data if you were curious about the weird value) and another resampled to 48khz, they'll be very different. The 20825hz version clearly has the vertically reflected spectrogram of aliasing, while the 48khz version doesn't. I tried 24khz as well since it seemed to be a sample rate used in training, but that also seemed like it might have aliasing.

sounds_input_output.zip

opened by torridgristle 3
Post-Training-Quantization

I trained blow attached algorithm and verifed performance with & without POST-TRAINING-QUNATIZATION . HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks.

Any update on Post-Training-Quantization model for [Bandwidth Extension is All You Need]?

opened by saivinaypsv 1

Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

Related tags

Overview

HiFi-GAN+

Status

Usage

Running with pipx

Running in a Virtual Environment

Pretrained Models

Training

Development

Setup

Acknowledgements

License

Comments

Very high CPU and memory usage whth hifi-synth

audio data frame size in inference time

I would like to know about the training data

Hugging Face Space: Aliasing in audio

Post-Training-Quantization

Owner

Brent M. Spell

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Combine Tacotron2 and Hifi GAN to generate speech from text

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

Unofficial Alias-Free GAN implementation. Based on rosinality's version with expanded training and inference options.

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

DR-GAN: Automatic Radial Distortion Rectification Using Conditional GAN in Real-Time

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

This is a pytorch implementation of the NeurIPS paper GAN Memory with No Forgetting.

Pytorch implementation of CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generation"

Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

Unofficial PyTorch implementation of Fastformer based on paper "Fastformer: Additive Attention Can Be All You Need"."

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Unofficial implementation of the paper: PonderNet: Learning to Ponder in TensorFlow