The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

THUHCSI

Last update: Oct 28, 2022

Related tags

Deep Learning VAENAR-TTS

Overview

VAENAR-TTS

This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis".

Samples | Paper | Pretrained Models

Usage

0. Dataset

English: LJSpeech
Mandarin: DataBaker(标贝)

1. Environment setup

conda env create -f environment.yml
conda activate vaenartts-env

2. Data pre-processing

For English using LJSpeech:

CUDA_VISIBLE_DEVICES= python preprocess.py --dataset ljspeech --data_dir /path/to/extracted/LJSpeech-1.1 --save_dir ./ljspeech

For Mandarin using Databaker(标贝):

CUDA_VISIBLE_DEVICES= python preprocess.py --dataset databaker --data_dir /path/to/extracted/biaobei --save_dir ./databaker

3. Training

For English using LJSpeech:

CUDA_VISIBLE_DEVICES=0 TF_FORCE_GPU_ALLOW_GROWTH=true python train.py --dataset ljspeech --log_dir ./lj-log_dir --test_dir ./lj-test_dir --data_dir ./ljspeech/tfrecords/ --model_dir ./lj-model_dir

For Mandarin using Databaker(标贝):

CUDA_VISIBLE_DEVICES=0 TF_FORCE_GPU_ALLOW_GROWTH=true python train.py --dataset databaker --log_dir ./db-log_dir --test_dir ./db-test_dir --data_dir ./databaker/tfrecords/ --model_dir ./db-model_dir

4. Inference (synthesize speech for the whole test set)

For English using LJSpeech:

CUDA_VISIBLE_DEVICES=0 TF_FORCE_GPU_ALLOW_GROWTH=true python inference.py --dataset ljspeech --test_dir ./lj-test-2000 --data_dir ./ljspeech/tfrecords/ --batch_size 16 --write_wavs true --draw_alignments true --ckpt_path ./lj-model_dir/ckpt-2000

For Mandarin using Databaker(标贝):

CUDA_VISIBLE_DEVICES=0 TF_FORCE_GPU_ALLOW_GROWTH=true python inference.py --dataset databaker --test_dir ./db-test-2000 --data_dir ./databaker/tfrecords/ --batch_size 16 --write_wavs true --draw_alignments true --ckpt_path ./db-model_dir/ckpt-2000

Reference

Comments

Can I use a GAN-based network to replace the flow-based prior P(Z|X)?

If I understand this paper and FlowSeq correctly, the normalizing flow is used to model the dependence of text X (from the posterior P(Z|X, Y)). As GAN can also model the distribution, can I use a GAN-based network to replace the flow-based prior P(Z|X)?

opened by seekerzz 4
Different result

Hi thank you for your awesome work. I tried to synthesize using inference script and checkpoint u provided the result sound robotic, why is that ? do I missing something ?

opened by kikirizki 2
关于跑代码时候发生 Type mismatch 的异常

在运行 preprocess.py 这步时出现异常

Traceback (most recent call last): File "train.py", line 329, in main() File "train.py", line 257, in main for fids, texts, mels, t_lengths, m_lengths in train_set.take(1): File "/home/wuyx/miniconda3/envs/vc/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 761, in next return self._next_internal() File "/home/wuyx/miniconda3/envs/vc/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 747, in _next_internal output_shapes=self._flat_output_shapes) File "/home/wuyx/miniconda3/envs/vc/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2728, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/wuyx/miniconda3/envs/vc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: Type mismatch between parsed tensor (float) and dtype (double) [[{{node ParseTensor_1}}]] [Op:IteratorGetNext]

opened by wuyx517 2
about log probability?

in theposterior.py code: time_level_log_probs = -0.5 * (tf.cast(dim, tf.float32) * tf.math.log(2 * np.pi)+ tf.reduce_sum(expanded_logvar + normalized_samples ** 2., axis=3))

but the log_prob of gaussian: log_probs = log(1.0 / (sqrt(2.0 * pi) * std) * exp(-0.5 * (x-u) ** 2 / std ** 2)) = -0.5 * (log(2.0 * pi) + 2.0 * log(std) + (x-u) ** 2 / std ** 2)

so you miss a constant value 2.0 of expanded_logvar although it doesn't matter? time_level_log_probs = -0.5 * (tf.cast(dim, tf.float32) * tf.math.log(2 * np.pi)+ tf.reduce_sum(2.0 * expanded_logvar + normalized_samples ** 2., axis=3))

opened by BridgetteSong 1
synthesized wavs of long texts

I downloaded the pretrained model of databaker and synthesized wavs using inference.py. The results are not very good, I mean the alignment is not right especially when the input text is long. For example, "失恋的人特别喜欢往人烟罕至的角落里钻。", the synthesized wavs sounds like: 失恋的人特别喜欢往人烟罕至的_角角落里钻钻钻钻_

For longer input text，the synthesized wavs are totally wrong

opened by Liujingxiu23 2
The config of hifigan used when generate samples

Hi, I want to know what config does you use when you train the hifigan model of DataBaker to get the samples in the webset https://light1726.github.io/vaenar-tts/.
With these parameters clearified, we can better compare the quality of the synthsized wavs with other SOTA acoustic models.

I mean the following three parameter in config file. "upsample_rates":
"upsample_kernel_sizes":
"upsample_initial_channel":

Thank you!

opened by Liujingxiu23 14

Owner

THUHCSI

Human-Computer Speech Interaction Lab at Tsinghua University

GitHub

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

13 Dec 5, 2022

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Dynamic vae - Dynamic VAE algorithm is used for anomaly detection of battery data

Dynamic VAE frame Automatic feature extraction can be achieved by probability di

10 Oct 7, 2022

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

69 Dec 17, 2022

SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling Reference Main paper to be cited (Di Wu et al., 2020) @article

34 Nov 3, 2022

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

66 Nov 28, 2022

Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Graph-to-Graph Transformers Self-attention models, such as Transformer, have been hugely successful in a wide range of natural language processing (NL

40 Aug 14, 2022

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

GLAT Implementation for the ACL2021 paper "Glancing Transformer for Non-Autoregressive Neural Machine Translation" Requirements Python >= 3.7 Pytorch

117 Jan 9, 2023

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

261 Nov 12, 2022

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

248 Dec 23, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

471 Dec 16, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Related tags

Overview

VAENAR-TTS

Samples | Paper | Pretrained Models

Usage

0. Dataset

1. Environment setup

2. Data pre-processing

3. Training

4. Inference (synthesize speech for the whole test set)

Reference

Comments

Can I use a GAN-based network to replace the flow-based prior P(Z|X)?

Different result

关于跑代码时候发生 Type mismatch 的异常

about log probability?

synthesized wavs of long texts

The config of hifigan used when generate samples

Owner

THUHCSI

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Dynamic vae - Dynamic VAE algorithm is used for anomaly detection of battery data

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

PyTorch package for the discrete VAE used for DALL·E.

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

VideoGPT: Video Generation using VQ-VAE and Transformers

Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN