Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Overview

Chinese mandarin text to speech based on Fastspeech2 and Unet

This is a modification and adpation of fastspeech2 to mandrin(普通话). Many modifications to the origin paper, including:

  1. Use UNet instead of postnet (1d conv). Unet is good at recovering spect details and much easier to train than original postnet
  2. Added hanzi(汉字,chinese character) embedding. It's harder for human being to read pinyin, but easier to read chinese character. Also this makes it more end-to-end.
  3. Removed pitch and energy embedding, and also the corresponding prediction network. This makes its much easier to train, especially for my gtx1060 card. I will try bringing them back if I have time (and hardware resources)
  4. Use only waveglow in synth, as it's much better than melgan and griffin-lim.
  5. subtracted the mel-mean for (seems much) easier prediction.
  6. Changed the loss weight to mel_postnet_loss x 1.0 + d_loss x 0.01 + mel_loss x 0.1
  7. Used linear duration scale instead of log, and subtracted the duration_mean in training.

Dependencies

All experiments were done under ubuntu16.04 + python3.7 + torch 1.7.1. Other env probably works too.

  • torch for training and inference
  • librosa and ffmpeg for basic audio processing
  • pypinyin用于转换汉字为拼音
  • jieba 用于分词
  • perf_logger用于写训练日志

First clone the project

git clone https://github.com/ranchlai/mandarin-tts.git

If too slow, try

git clone https://hub.fastgit.org/ranchlai/mandarin-tts.git

To install all dependencies, run


sudo apt-get install ffmpeg
pip3 install -r requirements.txt

Synthesize

python synthesize.py --input="您的电话余额不足,请及时充值"

or put all text in input.txt, then

python synthesize.py --input="./input.txt"

Checkpoints and waveglow should be downloaded at 1st run. You will see some files in ./checkpoint, and ./waveglow

In case it fails, download the checkpoint manully here

Audio samples

Audio samples can be found in this page

page

Model architecture

arch

Training

(under testing)

Currently I am use baker dataset(标贝), which can be downloaded from baker。 The dataset is for non-commercial purpose only, and so is the pretrained model.

I have processed the data for this experiment. You can also try

python3 preprocess_pinyin.py 
python3 preprocess_hanzi.py 

to generate required aligments, mels, vocab for pinyin and hanzi for training. Everythin should be ready under the directory './data/'(you can change the directory in hparams.py) before training.

python3 train.py

you can monitor the log in '/home/<user>/.perf_logger/'

Best practice: copy the ./data folder to /dev/shm to avoid harddisk reading (if you have big enough memorry)

The following are some spectrograms synthesized at step 300000

spect spect spect

TODO

  • Clean the training code
  • Add gan for better spectrogram prediction
  • Add Aishell3 support

References

Comments
  • 菜鸟需要aishell3_wav_folder和mel_folder 里的资料

    菜鸟需要aishell3_wav_folder和mel_folder 里的资料

    Sorry if I am asking the repeated question as I am just a beginner.

    I followed the readme until the wav2mel.py requires two folders where I believe there are training data from some other websites. I would appreciate if you could share the link for place where I can download the data, or you could email me the information via [email protected]

    Thank you very much.

    opened by mcitew 5
  • ckpt = torch.load(ckpt_file) 报错

    ckpt = torch.load(ckpt_file) 报错

    waveglow/ 里面的hubconf.py 200行 ckpt = torch.load(ckpt_file) 报错: RuntimeError: unexpected EOF, expected 459824 more bytes. The file might be corrupted. ckpt文件有问题了? 我的是ubuntu 2020, cuda version:11.3

    opened by wingdi 3
  • File not found running ../../mtts/train.py -c config.yaml -d cuda

    File not found running ../../mtts/train.py -c config.yaml -d cuda

    Here is my colab file, I am stuck at the

    %run ../../mtts/train.py -c config.yaml -d cuda

    It says 'No such file or directory: '../../mel_folder/SSB10560130.npy'

    Include please find my colab file.

    https://colab.research.google.com/drive/13gXl3NQSz97Fl__9wmCQJulEODD7Xe5-?usp=sharing

    opened by mcitew 2
  • Pinyin doesn't seem to align with hanzi when coming with erhua

    Pinyin doesn't seem to align with hanzi when coming with erhua

    firstly, thinks for your great work. just like the title says, for example: 001464|sil wo3 dou1 hui4 shuo1 er2 hua4 yin1 le5 sp1 zher4 ne5 sp1 mingr2 jian4 sp1 hai2 bu2 cuo4 ba5 sil
    001464|sil 我 都 会 说 儿 化 音 了 sp1 这 儿 sp1 呢 明 sp1 儿 见 还 不 sil

    zip: [('sil', 'sil'), ('我', 'wo3'), ('都', 'dou1'), ('会', 'hui4'), ('说', 'shuo1'), ('儿', 'er2'), ('化', 'hua4'), ('音', 'yin1'), ('了', 'le5'), ('sp1', 'sp1'), ('这', 'zher4'), ('儿', 'ne5'), ('sp1', 'sp1'), ('呢', 'mingr2'), ('明', 'jian4'), ('sp1', 'sp1'), ('儿', 'hai2'), ('见', 'bu2'), ('还', 'cuo4'), ('不', 'ba5'), ('sil', 'sil')]

    'zher4' should respect to '这儿' , and there misses some hanzi in the tail. do you think it matters?

    opened by lturing 1
  • The leak of audio folder

    The leak of audio folder

    Hi Author,

    During testing, I notice the audio folder is leaked, and I got it from original FastSpeech2 reop to make the demo working. Could you upload that package?

    Thanks, John

    opened by FengYen-Chang 1
  • 输入VIP出现问题。如何识别英文呢?  python ../../mtts/text/gp2py.py -t

    输入VIP出现问题。如何识别英文呢? python ../../mtts/text/gp2py.py -t "请谢国俊到VIP八诊室就诊"

    2022-02-27 19:08:54,545 synthesize.py: INFO: processing text1|sil qing3 xie4 guo2 jun4 dao4 VIP5 ba1 zhen3 shi4 jiu4 zhen3 sil|sil 请 谢 国 俊 到 V I P 八 诊 室 就 诊 sil|0 Traceback (most recent call last): File "../../mtts/synthesize.py", line 93, in name, tokens = text_processor(line) File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 41, in call return self._process(input) File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 35, in _process tokens = tokenizer.tokenize(seg) File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in tokenize tokens = [self.v2i[t] for t in text.split()] File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in tokens = [self.v2i[t] for t in text.split()] KeyError: 'VIP5'

    opened by xgj1988 0
  • AttributeError: module 'distutils' has no attribute 'version' 这个什么意思啊

    AttributeError: module 'distutils' has no attribute 'version' 这个什么意思啊

    Traceback (most recent call last): File "../../mtts/train.py", line 10, in from torch.utils.tensorboard import SummaryWriter File "/usr/local/lib/python3.7/site-packages/torch/utils/tensorboard/init.py", line 4, in LooseVersion = distutils.version.LooseVersion AttributeError: module 'distutils' has no attribute 'version'

    opened by xgj1988 0
  • 示例命令行有误

    示例命令行有误

    $ python ../../mtts/synthesize.py -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt usage: synthesize.py [-h] [-i INPUT] [--duration DURATION] [--output_dir OUTPUT_DIR] --checkpoint CHECKPOINT [-c CONFIG] [-d {cuda,cpu}] synthesize.py: error: ambiguous option: --c could match --checkpoint, --config

    opened by fa1c4 0
  • AISHELL3某些数据生成梅尔频谱失败的问题

    AISHELL3某些数据生成梅尔频谱失败的问题

    在aishell3数据中,有些wav文件通过librosa生成振幅向量的时候,振幅大小会超过1 如: SSB08870032.wav 文件的最大振幅为1.0116 导致运行wav2mel.py的时候会中断报错.

    具体问题如下: 文件 /mtts/utils/stft.py 第248 、249行 为什么要对wav的振幅向量限制在[-1,1]呢 ?

    opened by SoloPro-Git 4
  • 使用git clone 无法获得源代码

    使用git clone 无法获得源代码

    error: invalid path 'docs/novel2/hz_0.9_500000_在这儿做啥呢,不能啥玩意儿都带到这里来?.wav' fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry with 'git restore --source=HEAD :/'

    opened by longglecc 2
  • unable to train biaobei

    unable to train biaobei

    mandarin-tts/mtts/models/layers.py", line 91, in forward
        output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1)  # b x lq x (n*dv)
    RuntimeError: cannot reshape tensor of 0 elements into shape [0, 8, -1] because the unspecified dimension size -1 can be any value and is ambiguous
    
    opened by jinfagang 1
  • 关于vocoder模型下载的问题

    关于vocoder模型下载的问题

    你好作者,感谢你的无私分享。在复现的工作的时候初心一些小问题,在使用wav2mel.py的文件时,会用到一个一些函数,但是函数终会需要vocoder的一些gan模型,这些模型没有在给的文件中,我需要下载readme中的checkpoint还是link? ImportError: cannot import name 'HiFiGAN' from 'mtts.models.vocoder.hifi_gan' (unknown location)

    opened by DanMerry 0
Owner
vision, audio and NLP
null
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

GradTTS Unofficial Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech" (arxiv) About this repo This is an unoffic

HeyangXue1997 103 Dec 23, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 1, 2023
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 4, 2023
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 54 Aug 30, 2021
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

Rishikesh (ऋषिकेश) 55 Dec 26, 2022
FFTNet vocoder implementation

Unofficial Implementation of FFTNet vocode paper. implement the model. implement tests. overfit on a single batch (sanity check). linearize weights fo

Eren Gölge 81 Dec 8, 2022
Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

tarepan 5 Jan 11, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

null 2 Jan 5, 2023
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 170 Dec 27, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

Keon Lee 152 Jan 2, 2023
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 5, 2022
ChatBot-Pytorch - A GPT-2 ChatBot implemented using Pytorch and Huggingface-transformers

ChatBot-Pytorch A GPT-2 ChatBot implemented using Pytorch and Huggingface-transf

ParZival 42 Dec 9, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023
Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

One model to speak them all ?? Audio Language Text ▷ Chinese 人人生而自由,在尊严和权利上一律平等。 ▷ English All human beings are born free and equal in dignity and rig

Mutian He 60 Nov 14, 2022