Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Zhenhui YE

Last update: Nov 24, 2022

Related tags

Deep Learning SyntaSpeech

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.

Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments

pinyin preprocess problem

005804 你当#1我傻啊#3？脑子#1那么大#2怎么#1塞进去#4？ ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

what is 'a_?_n_ao3'

in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

opened by windowxiaoming 2
discriminator output['y_c'] never used

Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

opened by mayfool 2
A question of KL divergence calculation

In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1]，I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me！

opened by JiaYK 2

mfa for multi speaker.

In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

	item [1]:
		class = "IntervalTier"
		name = "words"
		xmin = 0.0
		xmax = 9.4444
		intervals: size = 56
			intervals [1]:
				xmin = 0
				xmax = 0.5700000000000001
				text = ""
			intervals [2]:
				xmin = 0.5700000000000001
				xmax = 0.61
				text = "eng"
			intervals [3]:
				xmin = 0.61
				xmax = 0.79
				text = "s_an1"
			intervals [4]:
				xmin = 0.79
				xmax = 0.89
				text = "eng"
			intervals [5]:
				xmin = 0.89
				xmax = 1.06
				text = "i1"
			intervals [6]:
				xmin = 1.06
				xmax = 1.24
				text = "eng"
			intervals [7]:
				xmin = 1.24
				xmax = 1.3
				text = ""
			intervals [8]:
				xmin = 1.3
				xmax = 1.36
				text = "s_an1"
			intervals [9]:
				xmin = 1.36
				xmax = 1.42
				text = ""
			intervals [10]:
				xmin = 1.42
				xmax = 1.49
				text = "eng"
			intervals [11]:
				xmin = 1.49
				xmax = 1.67
				text = "s_i4"
			intervals [12]:
				xmin = 1.67
				xmax = 1.78
				text = "eng"
			intervals [13]:
				xmin = 1.78
				xmax = 1.91
				text = ""
			intervals [14]:
				xmin = 1.91
				xmax = 1.96
				text = "er4"
			intervals [15]:
				xmin = 1.96
				xmax = 2.06
				text = "eng"
			intervals [16]:
				xmin = 2.06
				xmax = 2.19
				text = ""
			intervals [17]:
				xmin = 2.19
				xmax = 2.35
				text = "i1"
			intervals [18]:
				xmin = 2.35
				xmax = 2.53
				text = "eng"
			intervals [19]:
				xmin = 2.53
				xmax = 3.03
				text = "i1"
			intervals [20]:
				xmin = 3.03
				xmax = 3.42
				text = "eng"
			intervals [21]:
				xmin = 3.42
				xmax = 3.48
				text = "i1"
			intervals [22]:
				xmin = 3.48
				xmax = 3.6
				text = ""
			intervals [23]:
				xmin = 3.6
				xmax = 3.64
				text = "eng"
			intervals [24]:
				xmin = 3.64
				xmax = 3.86
				text = "i1"
			intervals [25]:
				xmin = 3.86
				xmax = 3.99
				text = "eng"
			intervals [26]:
				xmin = 3.99
				xmax = 4.59
				text = ""
			intervals [27]:
				xmin = 4.59
				xmax = 4.869999999999999
				text = "er4"
			intervals [28]:
				xmin = 4.869999999999999
				xmax = 4.9799999999999995
				text = "eng"
			intervals [29]:
				xmin = 4.9799999999999995
				xmax = 5.1899999999999995
				text = "s_i4"
			intervals [30]:
				xmin = 5.1899999999999995
				xmax = 5.34
				text = ""
			intervals [31]:
				xmin = 5.34
				xmax = 5.43
				text = "eng"
			intervals [32]:
				xmin = 5.43
				xmax = 5.6
				text = ""
			intervals [33]:
				xmin = 5.6
				xmax = 5.76
				text = "i1"
			intervals [34]:
				xmin = 5.76
				xmax = 6.279999999999999
				text = "eng"
			intervals [35]:
				xmin = 6.279999999999999
				xmax = 6.359999999999999
				text = "s_an1"
			intervals [36]:
				xmin = 6.359999999999999
				xmax = 6.47
				text = ""
			intervals [37]:
				xmin = 6.47
				xmax = 6.6
				text = "eng"
			intervals [38]:
				xmin = 6.6
				xmax = 6.9399999999999995
				text = "i1"
			intervals [39]:
				xmin = 6.9399999999999995
				xmax = 7.039999999999999
				text = "eng"
			intervals [40]:
				xmin = 7.039999999999999
				xmax = 7.289999999999999
				text = "s_an1"
			intervals [41]:
				xmin = 7.289999999999999
				xmax = 7.369999999999999
				text = "eng"
			intervals [42]:
				xmin = 7.369999999999999
				xmax = 7.6
				text = "s_i4"
			intervals [43]:
				xmin = 7.6
				xmax = 7.699999999999999
				text = "eng"
			intervals [44]:
				xmin = 7.699999999999999
				xmax = 7.869999999999999
				text = ""
			intervals [45]:
				xmin = 7.869999999999999
				xmax = 8.049999999999999
				text = "er4"
			intervals [46]:
				xmin = 8.049999999999999
				xmax = 8.26
				text = ""
			intervals [47]:
				xmin = 8.26
				xmax = 8.299999999999999
				text = "eng"
			intervals [48]:
				xmin = 8.299999999999999
				xmax = 8.36
				text = "s_i4"
			intervals [49]:
				xmin = 8.36
				xmax = 8.389999999999999
				text = ""
			intervals [50]:
				xmin = 8.389999999999999
				xmax = 8.42
				text = "eng"
			intervals [51]:
				xmin = 8.42
				xmax = 8.45
				text = ""
			intervals [52]:
				xmin = 8.45
				xmax = 8.59
				text = "s_an1"
			intervals [53]:
				xmin = 8.59
				xmax = 8.83
				text = ""
			intervals [54]:
				xmin = 8.83
				xmax = 9.1
				text = "eng"
			intervals [55]:
				xmin = 9.1
				xmax = 9.44
				text = "i1"
			intervals [56]:
				xmin = 9.44
				xmax = 9.4444
				text = ""

opened by leon2milan 2

Problem with DDP

Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

opened by zhazl 0

Releases(v1.0.0)

v1.0.0(May 21, 2022)

We release the pretrained models of SyntaSpeech on LJSpeech, Biaobei, and LibriTTS. For pretrained vocoder and datasets, please refer to the provided links in README.md
Source code(tar.gz)
Source code(zip)
biaobei_synta.zip(295.58 MB)
libritts_synta.zip(310.03 MB)
lj_synta.zip(304.98 MB)

Owner

Zhenhui YE

I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.

GitHub

MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

Multi-Graph Fusion Networks for Urban Region Embedding (IJCAI-22) This is the implementation of Multi-Graph Fusion Networks for Urban Region Embedding

202 Nov 18, 2022

A PyTorch implementation of "Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning", IJCAI-21

MERIT A PyTorch implementation of our IJCAI-21 paper Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning. Depen

Graph Analysis & Deep Learning Laboratory, GRAND

32 Jan 2, 2023

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

PAML PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021. (Continuously updating ) Int

15 Nov 18, 2022

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

47 Jan 9, 2023

[IJCAI'21] Deep Automatic Natural Image Matting

Deep Automatic Natural Image Matting [IJCAI-21] This is the official repository of the paper Deep Automatic Natural Image Matting. Introduction | Netw

316 Jan 6, 2023

Code for the IJCAI 2021 paper "Structure Guided Lane Detection"

SGNet Project for the IJCAI 2021 paper "Structure Guided Lane Detection" Abstract Recently, lane detection has made great progress with the rapid deve

27 Dec 8, 2022

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Box_Discretization_Network This repository is built on the pytorch [maskrcnn_benchmark]. The method is the foundation of our ReCTs-competition method

266 Nov 24, 2022

DTCN IJCAI - Sequential prediction learning framework and algorithm

DTCN This is the implementation of our paper "Sequential Prediction of Social Me

2 Jan 24, 2022

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

3 Aug 20, 2022

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes / 3DCrowdNet News ?? 3DCrowdNet achieves the state-of-the-art accuracy on 3D

113 Dec 21, 2022

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data (CVPR 2022) Potentials of primitive shapes f

31 Sep 27, 2022

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

34 Oct 26, 2022

Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Deep Constrained Least Squares for Blind Image Super-Resolution [Paper] This is the official implementation of 'Deep Constrained Least Squares for Bli

141 Dec 30, 2022

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion This repository contains a pytorch implementation of "Learning to Listen: Modeling

50 Dec 17, 2022

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

151 Dec 30, 2022

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

NTIRE 2022 - Image Inpainting Challenge Important dates 2022.02.01: Release of train data (input and output images) and validation data (only input) 2

37 Nov 27, 2022

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

87 Jan 8, 2023

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

274 Jan 5, 2023

Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Related tags

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

Environments

Run SyntaSpeech!

1. Preparation

Data Preparation

Vocoder Preparation

2. Training Example

3. Tensorboard

4. Inference Example

Audio Demos

Citation

Acknowledgements

Comments

pinyin preprocess problem

discriminator output['y_c'] never used

A question of KL divergence calculation

mfa for multi speaker.

Problem with DDP

Releases(v1.0.0)

v1.0.0(May 21, 2022)

Owner

Zhenhui YE

MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

A PyTorch implementation of "Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning", IJCAI-21

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

[IJCAI'21] Deep Automatic Natural Image Matting

Code for the IJCAI 2021 paper "Structure Guided Lane Detection"

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

DTCN IJCAI - Sequential prediction learning framework and algorithm

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)