Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Last update: Jan 2, 2023

Related tags

Deep Learning pytorch tts tacotron fastspeech2 tts-chinese tts-hanzi

Overview

Chinese mandarin text to speech based on Fastspeech2 and Unet

This is a modification and adpation of fastspeech2 to mandrin(普通话）. Many modifications to the origin paper, including:

Use UNet instead of postnet (1d conv). Unet is good at recovering spect details and much easier to train than original postnet
Added hanzi(汉字，chinese character) embedding. It's harder for human being to read pinyin, but easier to read chinese character. Also this makes it more end-to-end.
Removed pitch and energy embedding, and also the corresponding prediction network. This makes its much easier to train, especially for my gtx1060 card. I will try bringing them back if I have time (and hardware resources)
Use only waveglow in synth, as it's much better than melgan and griffin-lim.
subtracted the mel-mean for (seems much) easier prediction.
Changed the loss weight to mel_postnet_loss x 1.0 + d_loss x 0.01 + mel_loss x 0.1
Used linear duration scale instead of log, and subtracted the duration_mean in training.

Dependencies

All experiments were done under ubuntu16.04 + python3.7 + torch 1.7.1. Other env probably works too.

torch for training and inference
librosa and ffmpeg for basic audio processing
pypinyin用于转换汉字为拼音
jieba 用于分词
perf_logger用于写训练日志

First clone the project

git clone https://github.com/ranchlai/mandarin-tts.git

If too slow, try

git clone https://hub.fastgit.org/ranchlai/mandarin-tts.git

To install all dependencies, run


sudo apt-get install ffmpeg
pip3 install -r requirements.txt

Synthesize

python synthesize.py --input="您的电话余额不足，请及时充值"

or put all text in input.txt, then

python synthesize.py --input="./input.txt"

Checkpoints and waveglow should be downloaded at 1st run. You will see some files in ./checkpoint, and ./waveglow

In case it fails, download the checkpoint manully here

Audio samples

Audio samples can be found in this page

Model architecture

Training

(under testing)

Currently I am use baker dataset(标贝）, which can be downloaded from baker。 The dataset is for non-commercial purpose only, and so is the pretrained model.

I have processed the data for this experiment. You can also try

python3 preprocess_pinyin.py 
python3 preprocess_hanzi.py

to generate required aligments, mels, vocab for pinyin and hanzi for training. Everythin should be ready under the directory './data/'(you can change the directory in hparams.py) before training.

python3 train.py

you can monitor the log in '/home/<user>/.perf_logger/'

Best practice: copy the ./data folder to /dev/shm to avoid harddisk reading (if you have big enough memorry)

The following are some spectrograms synthesized at step 300000

TODO

Clean the training code
Add gan for better spectrogram prediction
Add Aishell3 support

References

FastSpeech2.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.

Comments

菜鸟需要aishell3_wav_folder和mel_folder 里的资料

Sorry if I am asking the repeated question as I am just a beginner.

I followed the readme until the wav2mel.py requires two folders where I believe there are training data from some other websites. I would appreciate if you could share the link for place where I can download the data, or you could email me the information via [email protected]

Thank you very much.

opened by mcitew 5
ckpt = torch.load(ckpt_file) 报错

waveglow/ 里面的hubconf.py 200行 ckpt = torch.load(ckpt_file) 报错： RuntimeError: unexpected EOF, expected 459824 more bytes. The file might be corrupted. ckpt文件有问题了？我的是ubuntu 2020, cuda version:11.3

opened by wingdi 3
File not found running ../../mtts/train.py -c config.yaml -d cuda

Here is my colab file, I am stuck at the

%run ../../mtts/train.py -c config.yaml -d cuda

It says 'No such file or directory: '../../mel_folder/SSB10560130.npy'

Include please find my colab file.

https://colab.research.google.com/drive/13gXl3NQSz97Fl__9wmCQJulEODD7Xe5-?usp=sharing

opened by mcitew 2
Pinyin doesn't seem to align with hanzi when coming with erhua

firstly, thinks for your great work. just like the title says, for example: 001464|sil wo3 dou1 hui4 shuo1 er2 hua4 yin1 le5 sp1 zher4 ne5 sp1 mingr2 jian4 sp1 hai2 bu2 cuo4 ba5 sil
001464|sil 我都会说儿化音了 sp1 这儿 sp1 呢明 sp1 儿见还不 sil

zip: [('sil', 'sil'), ('我', 'wo3'), ('都', 'dou1'), ('会', 'hui4'), ('说', 'shuo1'), ('儿', 'er2'), ('化', 'hua4'), ('音', 'yin1'), ('了', 'le5'), ('sp1', 'sp1'), ('这', 'zher4'), ('儿', 'ne5'), ('sp1', 'sp1'), ('呢', 'mingr2'), ('明', 'jian4'), ('sp1', 'sp1'), ('儿', 'hai2'), ('见', 'bu2'), ('还', 'cuo4'), ('不', 'ba5'), ('sil', 'sil')]

'zher4' should respect to '这儿' , and there misses some hanzi in the tail. do you think it matters?

opened by lturing 1
The leak of audio folder

Hi Author,

During testing, I notice the audio folder is leaked, and I got it from original FastSpeech2 reop to make the demo working. Could you upload that package?

Thanks, John

opened by FengYen-Chang 1
输入VIP出现问题。如何识别英文呢？ python ../../mtts/text/gp2py.py -t "请谢国俊到VIP八诊室就诊"

2022-02-27 19:08:54,545 synthesize.py: INFO: processing text1|sil qing3 xie4 guo2 jun4 dao4 VIP5 ba1 zhen3 shi4 jiu4 zhen3 sil|sil 请谢国俊到 V I P 八诊室就诊 sil|0 Traceback (most recent call last): File "../../mtts/synthesize.py", line 93, in name, tokens = text_processor(line) File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 41, in call return self._process(input) File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 35, in _process tokens = tokenizer.tokenize(seg) File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in tokenize tokens = [self.v2i[t] for t in text.split()] File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in tokens = [self.v2i[t] for t in text.split()] KeyError: 'VIP5'

opened by xgj1988 0
AttributeError: module 'distutils' has no attribute 'version' 这个什么意思啊

Traceback (most recent call last): File "../../mtts/train.py", line 10, in from torch.utils.tensorboard import SummaryWriter File "/usr/local/lib/python3.7/site-packages/torch/utils/tensorboard/init.py", line 4, in LooseVersion = distutils.version.LooseVersion AttributeError: module 'distutils' has no attribute 'version'

opened by xgj1988 0
示例命令行有误

$ python ../../mtts/synthesize.py -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt usage: synthesize.py [-h] [-i INPUT] [--duration DURATION] [--output_dir OUTPUT_DIR] --checkpoint CHECKPOINT [-c CONFIG] [-d {cuda,cpu}] synthesize.py: error: ambiguous option: --c could match --checkpoint, --config

opened by fa1c4 0
AISHELL3某些数据生成梅尔频谱失败的问题

在aishell3数据中,有些wav文件通过librosa生成振幅向量的时候,振幅大小会超过1 如: SSB08870032.wav 文件的最大振幅为1.0116 导致运行wav2mel.py的时候会中断报错.

具体问题如下: 文件 /mtts/utils/stft.py 第248 、249行为什么要对wav的振幅向量限制在[-1,1]呢 ?

opened by SoloPro-Git 4
使用git clone 无法获得源代码

error: invalid path 'docs/novel2/hz_0.9_500000_在这儿做啥呢,不能啥玩意儿都带到这里来?.wav' fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry with 'git restore --source=HEAD :/'

opened by longglecc 2

unable to train biaobei

mandarin-tts/mtts/models/layers.py", line 91, in forward
    output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1)  # b x lq x (n*dv)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, 8, -1] because the unspecified dimension size -1 can be any value and is ambiguous

opened by jinfagang 1

关于vocoder模型下载的问题

你好作者，感谢你的无私分享。在复现的工作的时候初心一些小问题，在使用wav2mel.py的文件时，会用到一个一些函数，但是函数终会需要vocoder的一些gan模型，这些模型没有在给的文件中，我需要下载readme中的checkpoint还是link？ ImportError: cannot import name 'HiFiGAN' from 'mtts.models.vocoder.hifi_gan' (unknown location)

opened by DanMerry 0

Owner

vision, audio and NLP

GitHub

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

GradTTS Unofficial Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech" (arxiv) About this repo This is an unoffic

103 Dec 23, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

157 Jan 1, 2023

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

170 Jan 4, 2023

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

54 Aug 30, 2021

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

55 Dec 26, 2022

FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

81 Dec 8, 2022

Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

5 Jan 11, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

2 Jan 5, 2023

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

152 Jan 2, 2023

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

13 Dec 5, 2022

ChatBot-Pytorch - A GPT-2 ChatBot implemented using Pytorch and Huggingface-transformers

ChatBot-Pytorch A GPT-2 ChatBot implemented using Pytorch and Huggingface-transf

42 Dec 9, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

183 Jan 3, 2023

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

One model to speak them all ?? Audio Language Text ▷ Chinese 人人生而自由，在尊严和权利上一律平等。 ▷ English All human beings are born free and equal in dignity and rig

60 Nov 14, 2022

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Related tags

Overview

Chinese mandarin text to speech based on Fastspeech2 and Unet

Dependencies

Synthesize

Audio samples

Model architecture

Training

TODO

References

Comments

Owner

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

FFTNet vocoder implementation

Fast and Simple Neural Vocoder, the Multiband RNNMS

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

African language Speech Recognition - Speech-to-Text

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

ChatBot-Pytorch - A GPT-2 ChatBot implemented using Pytorch and Huggingface-transformers

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,