🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

This is a repository for our paper, 🐤 Nix-TTS (Submitted to INTERSPEECH 2022). We released the pretrained models, an interactive demo, and audio samples below.

[ 📄 Paper Link] [ 🤗 Interactive Demo] [ 📢 Audio Samples]

Abstract    We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized generative TTS teacher model. Distilling a TTS model might sound unintuitive due to the generative and disjointed nature of TTS architectures, but pre-trained TTS models can be simplified into encoder and decoder structures, where the former encodes text into some latent representation and the latter decodes the latent into speech data. We devise a framework to distill each component in a non end-to-end fashion. Nix-TTS is end-to-end (vocoder-free) with only 5.23M parameters or up to 82% reduction of the teacher model, it achieves over 3.26x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi respectively, and still retains a fair voice naturalness and intelligibility compared to the teacher model.

Getting Started with Nix-TTS

Clone the nix-tts repository and move to its directory

git clone https://github.com/rendchevi/nix-tts.git
cd nix-tts

Install the dependencies

  • Install Python dependencies. We recommend python >= 3.8
pip install -r requirements.txt 
  • Install espeak in your device (for text tokenization).
sudo apt-get install espeak

Or follow the official instruction in case it didn't work.

Download your chosen pre-trained model here.

Model Num. of Params Faster than real-time* (CPU Intel-i7) Faster than real-time* (RasPi Model 3B)
Nix-TTS (ONNX) 5.23 M 11.9x 0.50x
Nix-TTS w/ Stochastic Duration (ONNX) 6.03 M 10.8x 0.50x

* Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.

And running Nix-TTS is as easy as:

from nix.models.TTS import NixTTSInference
from IPython.display import Audio

# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)

# Listen to the generated speech
Audio(xw[0,0], rate = 22050)

Acknowledgement

Comments
  • latest phonemizer unable to run

    latest phonemizer unable to run

    first, it got error in here:

    phonemizer_backend.phonemize(
                        self._expand_abbreviations(t.lower()),
                        strip=True,
                    )
    

    latest need a [] of text as input.

    And the return of new phoemizer is very strange:

    ['bˈɔːɹn tə mˈʌltɪplˌaɪ, bˈɔːɹn tə ɡˈeɪz ˌɪntʊ nˈaɪt skˈaɪz.']

    it returned something like this.

    opened by jinfagang 3
  • input text to phonemize() is str but it must be list of str

    input text to phonemize() is str but it must be list of str

    Hi, can you help me. I'm trying this code on google colab and I'm getting this error message.

    Your demo code: c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.") error: input text to phonemize() is str but it must be list of str

    I change the line like this: c, c_length, phoneme = nix.tokenize(["hi"]) error: 'list' object has no attribute 'lower'

    opened by bmox 1
  • add Gradio web demo/model to Huggingface

    add Gradio web demo/model to Huggingface

    Hi, would you be interested in adding nix-tts to Hugging Face as a Gradio Demo? I see there is already a streamlit demo and I think having a Gradio version would be cool as well.

    Example from other organizations: Keras: https://huggingface.co/keras-io Microsoft: https://huggingface.co/microsoft Facebook: https://huggingface.co/facebook

    Example spaces with repos: github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/salesforce/BLIP

    github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

    and here are guides for adding spaces/models/datasets to your org

    How to add a Space: https://huggingface.co/blog/gradio-spaces how to add models: https://huggingface.co/docs/hub/adding-a-model uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

    Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

    opened by AK391 1
  • Question about the parameter size

    Question about the parameter size

    Hello, according to your paper and github page, the parameter size is either 5.23M(deterministic predictor) or 6.03MB(stochastic predictor). However, the parameter size of the pretrained model far exceeds those. In the text & latent encoder, there are total 18 convs, each having a weight with shape 192x192x5, total (192x192x5)x18x4 Byte(since a float variable is 4 Byte) = 12.65625 MB. If we include bias of conv layers, an embedding layer, a projection layer, a duration predictor module, norm layers and a decoder module, the total parameter will be even bigger. Can you tell me how you calculated the parameter size?

    EDIT I noticed that the given value was the number of parameters, not the size of parameters. Thank you.

    opened by aask1357 0
  • Tuning and compilation with Apache TVM.

    Tuning and compilation with Apache TVM.

    Dear Developers, I spent some time trying to figure out how to compile NIX-TTS using Apache TVM. The idea was the same in concept to your compilation for Rasberry Pi. I discovered the following:

    1. Apache TVM requires inputs to have static not dynamic shapes.
    2. NIX-TTS models differ in how shapes of inputs and outputs are constructed based on being detoministic or stochastic model.

    It is possible to change the input shape to static, using for example the Sclblonnx package. I am unfortunately not that much involved in neural network frameworks to be able to figure out the required fixes to allow compilation.

    Since the detoministic model is faster and its voice quality is still fantastic, I had alook at the inputs:

    1. The "encoder" network most important inputs are both dynamic: ** Input: "c" as INT64 with dimensions: [0, 0] ** Input: "c_lengths" as INT64 with dimension: [0]
    2. The "decoder" network only input is as follows: ** Input: "z" as FLOAT with dimensions: [0, 0, 0]

    What would be the procedure to make the inputs static, if it is possible for the detoministic model? What would be the same case for the stochastic model if it is possible. Last but not least: since the detoministic model's encoder output is a dynamic shape as the decoder input, would it be possible to merge both graphs as a single model file?

    While I don't know if my case is possible to be implemented, I will be more than happy to descripe the procedure or publish the compiled model.

    Thank you in advance for any hints and feedback...

    opened by pawelurbanski 0
  • Comparison of distilled model vs end-to-end training from scratch

    Comparison of distilled model vs end-to-end training from scratch

    Just curious - did you try training the same model architecture end-to-end from scratch (i.e. not distilling from VITS), and if so, are there any audio comparison samples available?

    opened by nmfisher 1
  • Replace the IPython audio player with file writer in example

    Replace the IPython audio player with file writer in example

    I just tested nix-tts on Raspberry Pi 4, pretty impressive :+1: Realtime factor is 0.5 btw (2x faster than realtime), but I had some trouble writing the audio buffer into a file because the example depends on IPython which is a) not in the requirements and b) probably meant for Jupyter or Huggingface (?) not some local test.

    I tried to replace it with wave (because it is lightweight) like this:

    import wave
    ...
    wf = wave.open('test.wav', 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(22050)
    wf.setnframes(len(xw[0,0]))
    wf.writeframesraw(xw[0,0].tobytes())
    

    I does work somehow but there is obviously something wrong in the encoding since I'm getting mostly noise from that code.

    Since I couldn't solve the issue I gave up and used: scipy.io.wavfile.write("test.wav", 22050, xw[0,0]) in the end. It works but you have to install a bazillion more dependencies which takes forever on RPi4.

    So can you recommend any working alternative to scipy (which is not librosa ^^)?

    opened by fquirin 0
  • Hi, will u opensource distillation training code as well?

    Hi, will u opensource distillation training code as well?

    nix is quite impressive. I tried it's fast and natural compare with same params-level model.

    However, seems the distillation part is not open sourced, Just wonder if these part can available or not? so that users can compress own trained model.

    news 
    opened by jinfagang 5
Owner
Rendi Chevi
Rendi Chevi
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

GradTTS Unofficial Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech" (arxiv) About this repo This is an unoffic

HeyangXue1997 103 Dec 23, 2022
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 170 Dec 27, 2022
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 5, 2022
Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话). Many modifications t

null 291 Jan 2, 2023
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 1, 2023
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 7, 2022
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

PocketNet This is the official repository of the paper: PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and M

Fadi Boutros 40 Dec 22, 2022
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech Jaehyeon Kim, Jungil Kong, and Juhee Son In our rece

Jaehyeon Kim 1.7k Jan 8, 2023
Lightweight mmm - Lightweight (Bayesian) Media Mix Model

Lightweight (Bayesian) Media Mix Model This is not an official Google product. L

Google 342 Jan 3, 2023
LIAO Shuiying 6 Dec 1, 2022
Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment The official implementation of Arch-Net: Model Distillation for Architecture A

MEGVII Research 22 Jan 5, 2023
African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

null 2 Jan 5, 2023
A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection 1. 介绍 用以替代 NMS,在所有 bbox 中挑选出最优的集合。 NMS 仅考虑了 bbox 的得分,然后根据 IOU 来

null 44 Sep 15, 2022
A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

VAMSHI CHOWDARY 3 Jun 22, 2022