iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Rishikesh (ऋषिकेश)

Last update: Jan 2, 2023

Related tags

Overview

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

This repo try to implement iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform specifically model C8C8I. Disclaimer : This repo is build for testing purpose. The code is not optimized for performance.

Training :

python train.py --config config_v1.json

Note:

We are able to get good quality of audio with 30 % less training compared to original hifigan.
This model approx 60 % faster than counterpart hifigan.

Citations :

@inproceedings{kaneko2022istftnet,
title={{iSTFTNet}: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform},
author={Takuhiro Kaneko and Kou Tanaka and Hirokazu Kameoka and Shogo Seki},
booktitle={ICASSP},
year={2022},
}

References:

https://github.com/jik876/hifi-gan

Comments

window_sum in stft is just a constant?

I print the window_sum in stft, line: 155， find that the value will a constant, except for the former and latter padding positions. the window function only plays the role of linear scaling. Does this result meet the windowing expectations?

opened by xiaoyangnihao 3

RuntimeError: istft input and window must be on the same device but got self on cuda:0 and window on cpu

My command to run:

python3 train.py --config config_v1.json --input_wavs_dir /home/yehor/iSTFTNet-pytorch/lada_wavs --input_training_file /home/yehor/iSTFTNet-pytorch/training_list.txt --input_validation_file /home/yehor/iSTFTNet-pytorch/validation_list.txt

Error:

...        (2): Conv1d(128, 128, kernel_size=(11,), stride=(1,), padding=(5,))
      )
    )
  )
  (conv_post): Conv1d(128, 18, kernel_size=(7,), stride=(1,), padding=(3,))
  (reflection_pad): ReflectionPad1d((1, 0))
)
checkpoints directory :  cp_hifigan
Epoch: 1
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/yehor/.local/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train.py", line 280, in <module>
    main()
  File "train.py", line 276, in main
    train(0, a, h)
  File "train.py", line 126, in train
    y_g_hat = stft.inverse(spec, phase)
  File "/home/yehor/iSTFTNet-pytorch/stft.py", line 198, in inverse
    inverse_transform = torch.istft(
RuntimeError: istft input and window must be on the same device but got self on cuda:0 and window on cpu

opened by egorsmkv 2

Different sample rate

Hi @rishikksh20 , thanks for your work.

I have a question. If I want to use the 16K sampling rate, how do I modify the configuration file? It should not just modify sampling_rate in json.

opened by wizardk 2
Added generator's iSTFT size to config file & iSTFT speed-ups
Added 2 extra hyperparameters to config_v1.json:

"gen_istft_n_fft": 16

"gen_istft_hop_size": 4

Replaced Seetharaman's version of STFT with built-in torch implementation for a speed boost (because function window_sumsquare invoked by STFT.inverse is implemented in plain CPU)
opened by aqtq314 2
A multi-gpu training bug

stft.py line 164->165: window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices], would get errors . Because, inverse_transform might in cuda1 while window_sum in cuda0. Change line 164 to window_sum = window_sum.to(inverse_transform.device()) if magnitude.is_cuda else window_sum will fix the problem.

opened by mayfool 1
Single frequency line problem

Thanks for the implemention of ISTFT. It has better inference speed than hifigan v1.However, I found that there is a single frequency line which would cause little noise.I use 16KHZ dataset for training.And all the line is extractly at 4k which is the middle of the all frequency.I'm trying to fix this problem, do you have the same problem?

opened by mayfool 7
Fix TypeError: 'torch.device' object is not callable

As the issue https://github.com/rishikksh20/iSTFTNet-pytorch/issues/1, the line 164 in stft.py was changed to https://github.com/rishikksh20/iSTFTNet-pytorch/blob/e928a6b604033a3857757562af36241f9225adfc/stft.py#L164

But inverse_transform.device() will raise the exception mentioned in the title. So it can be changed to inverse_transform.device to fix the problem.

opened by leminhnguyen 0

Owner

Rishikesh (ऋषिकेश)

GitHub

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

2 Mar 26, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

54 Dec 3, 2022

A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

315 Dec 21, 2022

lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play around in the Colab notebook provided. Note that, in both cases, you will need to train a WaveGAN model first

91 Dec 23, 2022

Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

49 Dec 30, 2022

Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

466 Dec 6, 2022

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others.

FastAudioVisual Our project is developed here. The goal finish time is March 01, 2021 What is FastAudioVisual? FastAudioVisual is a tool that allows u

39 Oct 27, 2022

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

20 Oct 11, 2022

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

PocketSphinx 5prealpha This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech r

3.2k Dec 28, 2022

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Related tags

Overview

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Training :

Note:

Citations :

References:

Comments

window_sum in stft is just a constant?

RuntimeError: istft input and window must be on the same device but got self on cuda:0 and window on cpu

Different sample rate

Added generator's iSTFT size to config file & iSTFT speed-ups

A multi-gpu training bug

Single frequency line problem

Fix TypeError: 'torch.device' object is not callable

Owner

Rishikesh (ऋषिकेश)

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

A fast and lightweight python-based CTC beam search decoder for speech recognition.

lightweight, fast and robust columnar dataframe for data analytics with online update

A python framework to transform natural language questions to queries in a database query language.

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Biterm Topic Model (BTM): modeling topics in short texts

Various Algorithms for Short Text Mining

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

Easy, fast, effective, and automatic g-code compression!

Library for fast text representation and classification.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Library for fast text representation and classification.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production