HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Overview

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt
  4. Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.
validation loss

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name Generator Dataset Fine-Tuned
LJ_V1 V1 LJSpeech No
LJ_V2 V2 LJSpeech No
LJ_V3 V3 LJSpeech No
LJ_FT_T2_V1 V1 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V2 V2 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V3 V3 LJSpeech Yes (Tacotron2)
VCTK_V1 V1 VCTK No
VCTK_V2 V2 VCTK No
VCTK_V3 V3 VCTK No
UNIVERSAL_V1 V1 Universal No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

  1. Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
    The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
    Example:
    Audio File : LJ001-0001.wav
    Mel-Spectrogram File : LJ001-0001.npy
    
  2. Create ft_dataset folder and copy the generated mel-spectrogram files into it.
  3. Run the following command.
    python train.py --fine_tuning True --config config_v1.json
    
    For other command line options, please refer to the training section.

Inference from wav file

  1. Make test_files directory and copy wav files into the directory.
  2. Run the following command.
    python inference.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

  1. Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Run the following command.
    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

Comments
  • v1 Training time + training a 44Khz model on Universal VCTK+Blizzard2011+Clipper Datasets

    v1 Training time + training a 44Khz model on Universal VCTK+Blizzard2011+Clipper Datasets

    What's a rough estimate for training time of the 22Khz v1 pretrained model provided here?


    I'm currently training a 44Khz config. At the moment I'm training using 3 GPU's, batch_size=24 and segment_size=16384 for a total 1179648 samples per iter. Not sure how long I should leave it running, any rough estimate would be awesome.

    Thanks!

    opened by CookiePPP 18
  • [Question] Quality improvement using different generator parameters

    [Question] Quality improvement using different generator parameters

    I was wondering if it is possible to further improve quality by adjusting the generators parameters in the MRF, like the hidden dimension or the kernel sizes. Ideally it would get rid of even more artifacts. Has anybody tried this?

    opened by george-roussos 16
  • [Question] How to apply on 16k data?

    [Question] How to apply on 16k data?

    Hi, thanks for sharing your impressive code. I tried to apply hifigan on 16k data, with config: "upsample_rates": [8,5,5], "upsample_kernel_sizes": [16,10,10], "segment_size": 6400, "hop_size": 200, "win_size": 800, "sampling_rate": 16000,

    And it reports error like: Traceback (most recent call last): File "train.py", line 271, in <module> main() File "train.py", line 267, in main train(0, a, h) File "train.py", line 149, in train loss_fm_f = feature_loss(fmap_f_r, fmap_f_g) File "models.py", line 255, in feature_loss loss += torch.mean(torch.abs(rl - gl)) RuntimeError: The size of tensor a (1067) must match the size of tensor b (1068) at non-singleton dimension 2

    Is there any wrong in the modified config? Is it padding related?

    opened by OnceJune 15
  • Buzzing sound when using Tacotron2+HiFi-GAN

    Buzzing sound when using Tacotron2+HiFi-GAN

    Hi @jik876 @Edresson

    I have been trying to integrate tacotron2 and Hifi gan to create fully end to end TTS. But when I am feeding Tacotron2 output to your finetuned model of HiFi GAN output audio is just buzzing sound. To make sure tacotron2 output is correct, I fed the tacotron2 output to the parallel wavegan model, and it's working as expected. So believe there is some incompatibility while feeding tacotron2 output to Hifi gan output. To reproduce the same I created the colab notebook. You can reproduce the output with the above-mentioned notebook.

    Also, I created and End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR

    Also, I am in a plan to convert this Hifi GAN model to TFLite format to help mobile developers. We(@sayakpaul) already converted few models to TFLite format. You can find more details about our TFLite repo here

    opened by tulasiram58827 14
  • break point problem

    break point problem

    图片 the spec has beak point, what do you think the reason of this problem is? I have enlarged the receptive: [1,3,5]-->[1,3,5,7], but the problem still exists.

    And why your leak_relu_scope=0.1 instead of 0.2?

    opened by hdmjdp 14
  • RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

    RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

    I've encountered the following error when I'm trying to fine-tuning mel-spec from Tacotron2

    Traceback (most recent call last):
      File "train.py", line 276, in <module>
        main()
      File "train.py", line 272, in main
        train(0, a, h)
      File "train.py", line 127, in train
        y_g_hat = generator(x)
      File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/home2/nguyenlm/Projects/hifi-gan-clone/models.py", line 101, in forward
        x = self.conv_pre(x)
      File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 202, in forward
        self.padding, self.dilation, self.groups)
    RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
    
    opened by leminhnguyen 12
  • question about the finetune data

    question about the finetune data

    in readme ,you said when finetuned should "Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing" but the generated mel-spectrograms 's number is not equal the oral mel,in meldataset.py. got the audio like this if audio.size(1) >= self.segment_size: mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1) mel = mel[:, :, mel_start:mel_start + frames_per_seg] audio = audio[:, mel_start * self.hop_size:(mel_start + frames_per_seg) * self.hop_size]

    in this way, if we use the oral audio but use the generated mel, the traindata will be wrong, right?

    opened by yanyanxixi 10
  • error when fine-tuning

    error when fine-tuning

    I want to use FastSpeech 2 + hifigan, but it sound a little nosie in some generated audios, so I get the generate mel-spectrogram from FastSpeech 2 to retrain the hifigan, BUT MEET the error when fine-tuning

    `Loading 'cp_hifigan/g_00320000' Complete. Loading 'cp_hifigan/do_00320000' Complete. Epoch: 1227 Traceback (most recent call last): File "train.py", line 286, in main() File "train.py", line 280, in main mp.spawn(train, nprocs=h.num_gpus, args=(a, h,)) File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

    -- Process 2 terminated with the following error: Traceback (most recent call last): File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/data/glusterfs_speech_tts_v2/public_data/tts_public_data/11104653/vocoder/hifi-gan/train.py", line 113, in train for i, batch in enumerate(train_loader): File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/espnet/tools/venv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate return [default_collate(samples) for samples in transposed] File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in return [default_collate(samples) for samples in transposed] File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6712] at entry 10`

    opened by JoeyHeisenberg 10
  • GPU memory allocation

    GPU memory allocation

    I'm getting this error on a relatively low-powered system I'm using for testing: RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.80 GiB already allocated; 17.10 MiB free; 2.82 GiB reserved in total by PyTorch) Where can I set this to a lower value for this purpose? Also, if I try to use fine_tuning (using mel .npy files created by fastspeech2 pre-processing) I get errors indicating a tensor dimension mismatch. Are these files incompatible with hifi-gan? Thank you!

    opened by rspiewak47 7
  • How to draw the

    How to draw the "Pixel-wise difference in the mel-spectrogram domain " picture?

    In the HifiGAN paper, figure3 shows the differ between generated waveform-mel and Tacotron2 generated mel. Those two mel-spectrogram may be have different length, then how to padding the two mel-sequence to make substraction and to get the 'Pixel-wise' picture? Is there any tools for that?

    opened by JohnHerry 5
  • Why does the mel.npy file I trained with the tacotron2 model not match the dimensions in hifi-gan?

    Why does the mel.npy file I trained with the tacotron2 model not match the dimensions in hifi-gan?

    I trained the tacotron2 model to produce mel_**.npy ,but in this model, the error of dimension mismatch is reported python3 inference_e2e.py --checkpoint_file cp_hifigan-1208-test/g_00036000 Initializing Inference Process.. cp_hifigan-1208-test/g_00036000 Loading 'cp_hifigan-1208-test/g_00036000' Complete. Removing weight norm... Traceback (most recent call last): File "inference_e2e.py", line 90, in <module> main() File "inference_e2e.py", line 86, in main inference(a) File "inference_e2e.py", line 51, in inference y_g_hat = generator(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/hifi-gan-master/models.py", line 101, in forward x = self.conv_pre(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 200, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight 128 80, but got 2-dimensional input of size [288, 80] instead

    opened by c9412600 5
  • Teacher-Forcing How To

    Teacher-Forcing How To

    I've read just about ever single post mentioning teacher-forcing and it's always discussed as a forgone conclusion that everybody understands exactly what this means. I cannot find any mention of this other than just general, "you must use teacher-forcing". As close to an explanation I saw was about using fine tuning where you need the original wavs and the mel outputs to do this. But in what format are these? The same folder? Different folders? Is there a text file that links them in someway?

    I assume this is something as simple as:

    1. Create a folder with x data in y format
    2. Run this python script with these parameters

    but for the life of me I cannot figure out what these things are.

    Is there anybody out there that can explain what this process actually looks like?

    opened by SuperJonotron 0
  • Mel spectrogram npy contents

    Mel spectrogram npy contents

    I'd like to train hifi-gan on a custom dataset with its own set of wav files. For this I need to generate the corresponding mel spectrograms, which the readme says can be done using Tacotron2 although it does not give much detail into it.

    However, generating mel spectrograms should be pretty straightforward as it just involves a stft, a complex magnitude, mel banks you can get from librosa, and possibly a log operation with some minimum value. You just need to make sure to use the same settings as specified in the config file.

    I'd like to use my own scripts to generate these mel spectrograms without having to deal with the Tacotron2 code if possible, but I need to know what the contents of the npy files should be. Are them just a numpy arrays of shape [num_frames, mel_bins]? Are the values in log scale (log mel spectrograms)?

    Thanks.

    opened by leandro-gracia-gil 0
  • Tacotron + HIFI GAN Fine tuned: Sounds distorted.

    Tacotron + HIFI GAN Fine tuned: Sounds distorted.

    Hello, this is something very strange that has been happening to me more and more frequently. When fine tuning a tacotron model, the end result sounds distorted. I have trained it for more than 5k steps (as indicated in the notebook I am using and will leave below). The dataset is about 40 minutes with very good quality audios. all are in 22 khz, mono, 16 bits. Tacotron could train it without problems, but HIFI GAN could not.

    This is the notebook: https://colab.research.google.com/github/justinjohn0306/FakeYou-Tacotron2-Notebook/blob/main/FakeYou_HiFi_GAN_Fine_Tuning.ipynb?authuser=1#scrollTo=teF-Ut8Z7Gjp

    This is the demo distorted audio: https://drive.google.com/file/d/1cuqfWGS1JmSMNlcnyd_PaH3Atv-PPxvB/view?usp=share_link

    This is the original dataset audio sample: https://drive.google.com/file/d/1ReqoxwHSRfu3D186jhQynCJXPQ1vhWZx/view?usp=share_link

    This is what I get when I synthesize: image

    Thank you!

    opened by Mixomo 0
  • Train/test split used for VCTK data

    Train/test split used for VCTK data

    Thank you for making your code and pretrained models available.

    I would like to use your pretrained VCTK models for benchmarking in an evaluation. I assume that the VCTK models at https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y are the ones used for the work presented in Section 4.3 of the HIFI-GAN paper (https://arxiv.org/pdf/2010.05646). The paper mentions that nine speakers were randomly held out from training: it would be helpful to know which speakers these were. The 5 samples given in the "Unseen Speakers (VCTK Dataset)" section of https://jik876.github.io/hifi-gan-demo/ are from from VCTK speakers p226, p271, p226 (again), p318 and p292; and also the samples seem to be from the mic1 version of the data, which I assume was used for training. It would be great if you could provide the IDs of the other 5 speakers, or the training.txt/validation.txt files that were used to train these models.

    Thanks!

    opened by spun-oliver 0
Owner
Jungil Kong
Jungil Kong
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 7, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

Corentin Jemine 38.5k Jan 3, 2023
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 5, 2023
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 9, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021