LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Related tags

Text Data & NLP text-to-speech speech pytorch tts speech-synthesis fastspeech fastspeech2 lightspeech

Overview

LightSpeech

UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search. This repo uses the FastSpeech 2 implementation of Espnet as a base. This repo only implements the final version of LightSpeech model not the Neural Architecture Search as mentioned in paper.

But I am able to compress only 3x (from 27 M to 7.99 M trainable parameters) not 15x.

Requirements :

All code written in Python 3.6.2 .

Install Pytorch

Before installing pytorch please check your Cuda version by running following command : nvcc --version

pip install torch torchvision

In this repo I have used Pytorch 1.6.0 for torch.bucketize feature which is not present in previous versions of PyTorch.

Installing other requirements :

pip install -r requirements.txt

To use Tensorboard install tensorboard version 1.14.0 seperatly with supported tensorflow (1.14.0)

For Preprocessing :

filelists folder contains MFA (Motreal Force aligner) processed LJSpeech dataset files so you don't need to align text with audio (for extract duration) for LJSpeech dataset. For other dataset follow instruction here. For other pre-processing run following command :

python .\nvidia_preprocessing.py -d path_of_wavs -c configs/default.yaml

For finding the min and max of F0 and Energy

python .\compute_statistics.py

Update the following in hparams.py by min and max of F0 and Energy

p_min = Min F0/pitch
p_max = Max F0
e_min = Min energy
e_max = Max energy

For training

 python train_lightspeech.py --outdir etc -c configs/default.yaml -n "name"

For inference

WIP

python .\inference.py -c .\configs\default.yaml -p .\checkpoints\first_1\xyz.pyt --out output --text "ModuleList can be indexed like a regular Python list but modules it contains are properly registered."

For TorchScript Export

python export_torchscript.py -c configs/default.yaml -n fastspeech_scrip --outdir etc

Checkpoint and samples:

WIP

References

Comments

[Bug] duration predictor safe guard

https://github.com/rishikksh20/LightSpeech/blob/d9290f755f02d33d520c2304c5b6624f87864e55/core/duration_modeling/length_regulator.py#L89

Runtime error when length predicted negative value. Maybe need to modify to: out.append(x_.repeat(max(int(d_), 0), 1))

opened by OnceJune 1
Duration in filelist starts with -2?

Hi, thanks for sharing the code. When I tried to run lightspeech on my own dataset, I found in sample file list, the sum duration always starts from -2: https://github.com/rishikksh20/LightSpeech/blob/d9290f755f02d33d520c2304c5b6624f87864e55/filelists/valid_filelist.txt#L1 While when I check the guide in README: https://github.com/ivanvovk/DurIAN#6-how-to-align-your-own-data Found that the given file list have sum durations starts from 0: https://github.com/ivanvovk/DurIAN/blob/a708e9c5bb89895ddf08ca1a13bc8fd683b1e23f/filelists/valid_filelist.txt#L1 So my question is, how did the -2 comes from? I now completed with MFA and tried to build file list, should I add a -2 at the very beginning? Thanks in advance.

opened by OnceJune 1
Custom dataset - TextGrid to filelist
I have 2 question in filelist:

Why -2 is the default start point?

Some phonemes are 1 less in length than durations. What does it mean?

Thanks for you code!
opened by huypl53 0

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

10 Oct 13, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

919 Jan 3, 2023

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

PocketSphinx 5prealpha This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech r

3.2k Dec 28, 2022

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

33 Sep 22, 2022

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

1.8k Dec 30, 2022

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Related tags

Overview

LightSpeech

Requirements :

For Preprocessing :

For training

For inference

For TorchScript Export

Checkpoint and samples:

References

You might also like...

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Comments

[Bug] duration predictor safe guard

Duration in filelist starts with -2?

Custom dataset - TextGrid to filelist

Owner

Rishikesh (ऋषिकेश)

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

A fast and lightweight python-based CTC beam search decoder for speech recognition.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Simple Speech to Text, Text to Speech

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

A Python module made to simplify the usage of Text To Speech and Speech Recognition.