🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi

Last update: Jan 9, 2023

Related tags

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

This is a repository for our paper, 🐤 Nix-TTS (Submitted to INTERSPEECH 2022). We released the pretrained models, an interactive demo, and audio samples below.

[ 📄 Paper Link] [ 🤗 Interactive Demo] [ 📢 Audio Samples]

Abstract We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized generative TTS teacher model. Distilling a TTS model might sound unintuitive due to the generative and disjointed nature of TTS architectures, but pre-trained TTS models can be simplified into encoder and decoder structures, where the former encodes text into some latent representation and the latter decodes the latent into speech data. We devise a framework to distill each component in a non end-to-end fashion. Nix-TTS is end-to-end (vocoder-free) with only 5.23M parameters or up to 82% reduction of the teacher model, it achieves over 3.26x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi respectively, and still retains a fair voice naturalness and intelligibility compared to the teacher model.

Getting Started with Nix-TTS

Clone the nix-tts repository and move to its directory

git clone https://github.com/rendchevi/nix-tts.git
cd nix-tts

Install the dependencies

Install Python dependencies. We recommend python >= 3.8

pip install -r requirements.txt

Install espeak in your device (for text tokenization).

sudo apt-get install espeak

Or follow the official instruction in case it didn't work.

Download your chosen pre-trained model here.

Model	Num. of Params	Faster than real-time^* (CPU Intel-i7)	Faster than real-time^* (RasPi Model 3B)
Nix-TTS (ONNX)	5.23 M	11.9x	0.50x
Nix-TTS w/ Stochastic Duration (ONNX)	6.03 M	10.8x	0.50x

^* Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.

And running Nix-TTS is as easy as:

from nix.models.TTS import NixTTSInference
from IPython.display import Audio

# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)

# Listen to the generated speech
Audio(xw[0,0], rate = 22050)

Acknowledgement

This research is fully and exclusively funded by Kata.ai, where the authors work as part of the Kata.ai Research Team.
Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of VITS and Comprehensive-Transformer-TTS.

Comments

latest phonemizer unable to run
first, it got error in here:

phonemizer_backend.phonemize( self._expand_abbreviations(t.lower()), strip=True, )

latest need a [] of text as input.

And the return of new phoemizer is very strange:

['bˈɔːɹn tə mˈʌltɪplˌaɪ, bˈɔːɹn tə ɡˈeɪz ˌɪntʊ nˈaɪt skˈaɪz.']

it returned something like this.
opened by jinfagang 3
input text to phonemize() is str but it must be list of str

Hi, can you help me. I'm trying this code on google colab and I'm getting this error message.

Your demo code: c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.") error: input text to phonemize() is str but it must be list of str

I change the line like this: c, c_length, phoneme = nix.tokenize(["hi"]) error: 'list' object has no attribute 'lower'

opened by bmox 1
add Gradio web demo/model to Huggingface

Hi, would you be interested in adding nix-tts to Hugging Face as a Gradio Demo? I see there is already a streamlit demo and I think having a Gradio version would be cool as well.

Example from other organizations: Keras: https://huggingface.co/keras-io Microsoft: https://huggingface.co/microsoft Facebook: https://huggingface.co/facebook

Example spaces with repos: github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces how to add models: https://huggingface.co/docs/hub/adding-a-model uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

opened by AK391 1
Question about the parameter size

Hello, according to your paper and github page, the parameter size is either 5.23M(deterministic predictor) or 6.03MB(stochastic predictor). However, the parameter size of the pretrained model far exceeds those. In the text & latent encoder, there are total 18 convs, each having a weight with shape 192x192x5, total (192x192x5)x18x4 Byte(since a float variable is 4 Byte) = 12.65625 MB. If we include bias of conv layers, an embedding layer, a projection layer, a duration predictor module, norm layers and a decoder module, the total parameter will be even bigger. Can you tell me how you calculated the parameter size?

EDIT I noticed that the given value was the number of parameters, not the size of parameters. Thank you.

opened by aask1357 0
Tuning and compilation with Apache TVM.
Dear Developers, I spent some time trying to figure out how to compile NIX-TTS using Apache TVM. The idea was the same in concept to your compilation for Rasberry Pi. I discovered the following:

Apache TVM requires inputs to have static not dynamic shapes.

NIX-TTS models differ in how shapes of inputs and outputs are constructed based on being detoministic or stochastic model.

It is possible to change the input shape to static, using for example the Sclblonnx package. I am unfortunately not that much involved in neural network frameworks to be able to figure out the required fixes to allow compilation.

Since the detoministic model is faster and its voice quality is still fantastic, I had alook at the inputs:

The "encoder" network most important inputs are both dynamic: ** Input: "c" as INT64 with dimensions: [0, 0] ** Input: "c_lengths" as INT64 with dimension: [0]

The "decoder" network only input is as follows: ** Input: "z" as FLOAT with dimensions: [0, 0, 0]

What would be the procedure to make the inputs static, if it is possible for the detoministic model? What would be the same case for the stochastic model if it is possible. Last but not least: since the detoministic model's encoder output is a dynamic shape as the decoder input, would it be possible to merge both graphs as a single model file?

While I don't know if my case is possible to be implemented, I will be more than happy to descripe the procedure or publish the compiled model.

Thank you in advance for any hints and feedback...
opened by pawelurbanski 0
Comparison of distilled model vs end-to-end training from scratch

Just curious - did you try training the same model architecture end-to-end from scratch (i.e. not distilling from VITS), and if so, are there any audio comparison samples available?

opened by nmfisher 1
Replace the IPython audio player with file writer in example
I just tested nix-tts on Raspberry Pi 4, pretty impressive :+1: Realtime factor is 0.5 btw (2x faster than realtime), but I had some trouble writing the audio buffer into a file because the example depends on IPython which is a) not in the requirements and b) probably meant for Jupyter or Huggingface (?) not some local test.

I tried to replace it with wave (because it is lightweight) like this:

import wave ... wf = wave.open('test.wav', 'wb') wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(22050) wf.setnframes(len(xw[0,0])) wf.writeframesraw(xw[0,0].tobytes())

I does work somehow but there is obviously something wrong in the encoding since I'm getting mostly noise from that code.

Since I couldn't solve the issue I gave up and used: scipy.io.wavfile.write("test.wav", 22050, xw[0,0]) in the end. It works but you have to install a bazillion more dependencies which takes forever on RPi4.

So can you recommend any working alternative to scipy (which is not librosa ^^)?
opened by fquirin 0
Hi, will u opensource distillation training code as well?

nix is quite impressive. I tried it's fast and natural compare with same params-level model.

However, seems the distillation part is not open sourced, Just wonder if these part can available or not? so that users can compress own trained model.
news

opened by jinfagang 5

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Related tags

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Getting Started with Nix-TTS

Acknowledgement

Comments

latest phonemizer unable to run

input text to phonemize() is str but it must be list of str

Hi, can you help me. I'm trying this code on google colab and I'm getting this error message.

add Gradio web demo/model to Huggingface

Question about the parameter size

Tuning and compilation with Apache TVM.

Comparison of distilled model vs end-to-end training from scratch

Replace the IPython audio player with file writer in example

Hi, will u opensource distillation training code as well?

Owner

Rendi Chevi

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Lightweight mmm - Lightweight (Bayesian) Media Mix Model

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

African language Speech Recognition - Speech-to-Text

A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Related tags

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Getting Started with Nix-TTS

Acknowledgement

Comments

Hi, can you help me. I'm trying this code on google colab and I'm getting this error message.

Owner

Rendi Chevi

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Lightweight mmm - Lightweight (Bayesian) Media Mix Model

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

African language Speech Recognition - Speech-to-Text

A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,