HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Overview

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt
  4. Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.
validation loss

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name Generator Dataset Fine-Tuned
LJ_V1 V1 LJSpeech No
LJ_V2 V2 LJSpeech No
LJ_V3 V3 LJSpeech No
LJ_FT_T2_V1 V1 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V2 V2 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V3 V3 LJSpeech Yes (Tacotron2)
VCTK_V1 V1 VCTK No
VCTK_V2 V2 VCTK No
VCTK_V3 V3 VCTK No
UNIVERSAL_V1 V1 Universal No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

  1. Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
    The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
    Example:
    Audio File : LJ001-0001.wav
    Mel-Spectrogram File : LJ001-0001.npy
    
  2. Create ft_dataset folder and copy the generated mel-spectrogram files into it.
  3. Run the following command.
    python train.py --fine_tuning True --config config_v1.json
    
    For other command line options, please refer to the training section.

Inference from wav file

  1. Make test_files directory and copy wav files into the directory.
  2. Run the following command.
    python inference.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

  1. Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Run the following command.
    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

You might also like...
NR-GAN: Noise Robust Generative Adversarial Networks
NR-GAN: Noise Robust Generative Adversarial Networks

NR-GAN: Noise Robust Generative Adversarial Networks (CVPR 2020) This repository provides PyTorch implementation for noise robust GAN (NR-GAN). NR-GAN

Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary Differential Equations
Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary Differential Equations

ODE GAN (Prototype) in PyTorch Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary

Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis Multi-View Consistent Generative Adversarial Networks for 3D-aware

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)
PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

Fre-GAN Vocoder Fre-GAN: Adversarial Frequency-consistent Audio Synthesis Training: python train.py --config config.json Citation: @misc{kim2021frega

 House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects
House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects

House-GAN++ Code and instructions for our paper: House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

Owner
Rishikesh (ऋषिकेश)
Deep Learning/ AI Researcher | Open Source enthusiast | Text to Speech | Speech Synthesis | Generative Models | Object detection | Language Understanding
Rishikesh (ऋषिकेश)
Combine Tacotron2 and Hifi GAN to generate speech from text

EndToEndTextToSpeech Combine Tacotron2 and Hifi GAN to generate speech from text Download weights Hifi GAN -> hifi_gan/checkpoint/ : pretrain 2.5M ste

Phạm Quốc Huy 1 Dec 18, 2021
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 1, 2023
π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis Project Page | Paper | Data Eric Ryan Chan*, Marco Monteiro*, Pe

null 375 Dec 31, 2022
Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

Flickr-Faces-HQ Dataset (FFHQ) Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative

NVIDIA Research Projects 2.9k Dec 28, 2022
Generate high quality pictures. GAN. Generative Adversarial Networks

ESRGAN generate high quality pictures. GAN. Generative Adversarial Networks """ Super-resolution of CelebA using Generative Adversarial Networks. The

Lieon 1 Dec 14, 2021
Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

HFGI: High-Fidelity GAN Inversion for Image Attribute Editing High-Fidelity GAN Inversion for Image Attribute Editing Update: We released the inferenc

Tengfei Wang 371 Dec 30, 2022
SCI-AIDE : High-fidelity Few-shot Histopathology Image Synthesis for Rare Cancer Diagnosis

SCI-AIDE : High-fidelity Few-shot Histopathology Image Synthesis for Rare Cancer Diagnosis Pretrained Models In this work, we created synthetic tissue

Emirhan Kurtuluş 1 Feb 7, 2022
Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Neural Spatio-Temporal Point Processes [arxiv] Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel Abstract. We propose a new class of parameterizations

Facebook Research 75 Dec 19, 2022
Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

HiFi-GAN+ This project is an unoffical implementation of the HiFi-GAN+ model for audio bandwidth extension, from the paper Bandwidth Extension is All

Brent M. Spell 134 Dec 30, 2022
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

HiFi++ : a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement This is the unofficial implementation of Vocoder part of

Rishikesh (ऋषिकेश) 118 Dec 29, 2022