A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

Overview

简体中文 | English

并行语音合成

[TOC]

新进展

目录结构

.
|--- config/      # 配置文件
     |--- default.yaml
     |--- ...
|--- datasets/    # 数据处理
|--- encoder/     # 声纹编码器
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # 一些辅助类
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # 训练过程保存目录
|--- losses/      # 一些损失函数
|--- models/      # 合成模型
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # 预训练模型(LJSpeech 数据集)
|--- samples/     # 合成样例
|--- utils/       # 一些通用方法
|--- vocoder/     # 声码器
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt    # 依赖文件
|--- synthesize.py       # 合成脚本
|--- train-duration.py   # 训练脚本
|--- train-parallel.py

合成样例

部分合成样例见这里

预训练

部分预训练模型见这里

快速开始

步骤(1):克隆仓库

$ git clone https://github.com/atomicoo/ParallelTTS.git

步骤(2):安装依赖

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

步骤(3):合成语音

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

如果要合成其他语种的语音,需要通过 --config 指定相应的配置文件。

如何训练

步骤(1):准备数据

$ python prepare-dataset.py

通过 --config 可以指定配置文件,默认的 default.yaml 针对 LJSpeech 数据集。

步骤(2):训练对齐模型

$ python train-duration.py

步骤(3):提取持续时间

$ python extract-duration.py

通过 --ground_truth 可以指定是否利用对齐模型生成 Ground-Truth 声谱图。

步骤(4):训练合成模型

$ python train-parallel.py

通过 --ground_truth 可以指定是否使用 Ground-Truth 声谱图进行模型训练。

训练日志

如果使用 TensorBoardX,则运行如下命令:

$ tensorboard --logdir logdir/[DIR]/

强烈推荐使用 Wandb(Weights & Biases),只需在上述训练命令中增加 --enable_wandb 选项。

数据集

  • LJSpeech:英语,女性,22050 Hz,约 24 小时
  • LibriSpeech:英语,多说话人(仅使用 train-clean-100 部分),16000 Hz,总计约 1000 小时
  • JSUT:日语,女性,48000 Hz,约 10 小时
  • BiaoBei:普通话,女性,48000 Hz,约 12 小时
  • KSS:韩语,女性,44100 Hz,约 12 小时
  • RuLS:俄语,多说话人(仅使用单一说话人音频),16000 Hz,总计约 98 小时
  • TWLSpeech(非公开,质量较差):藏语,女性(多说话人,音色相近),16000 Hz,约 23 小时

质量评估

TODO:待补充

速度指标

训练速度:对于 LJSpeech 数据集,设置批次尺寸为 64,可以在单张 8GB 显存的 GTX 1080 显卡上进行训练,训练 ~8h(~300 epochs)后即可合成质量较高的语音。

合成速度:以下测试在 CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150 下进行,每段合成音频在 8 秒左右(约 20 词)

批次尺寸 Spec
(GPU)
Audio
(GPU)
Spec
(CPU)
Audio
(CPU)
1 0.042 0.218 0.100 2.004
2 0.046 0.453 0.209 3.922
4 0.053 0.863 0.407 7.897
8 0.062 2.386 0.878 14.599

注意,没有进行多次测试取平均值,结果仅供参考。

一些问题

  • wavegan 分支中,vocoder 代码取自 ParallelWaveGAN,由于声学特征提取方式不兼容,需要进行转化,具体转化代码见这里
  • 普通话模型的文本输入选择拼音序列,因为 BiaoBei 的原始拼音序列不包含标点、以及对齐模型训练不完全,所以合成语音的节奏会有点问题。
  • 韩语模型没有专门训练对应的声码器,而是直接使用 LJSpeech(同为 22050 Hz)的声码器,可能稍微影响合成语音的质量。

参考资料

TODO

  • 合成语音质量评估(MOS)
  • 更多不同语种的测试
  • 语音风格迁移(音色)

欢迎交流

  • 微信号:Joee1995

  • 企鹅号:793071559

You might also like...
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Auto translate textbox from Japanese to English or Indonesia
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

A demo for end-to-end English and Chinese text spotting using ABCNet.
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

TTS is a library for advanced Text-to-Speech generation.
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Command Line Text-To-Speech using Google TTS
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Comments
  • train-parallel.py 训练中有错误

    train-parallel.py 训练中有错误

    Traceback (most recent call last): File "/home/gaol/codes/Voices/FCH-TTS/train-parallel.py", line 69, in loggers=loggers File "/home/gaol/codes/Voices/FCH-TTS/helpers/trainer.py", line 319, in fit valid_losses = self._validate(valid_loader) File "/home/gaol/codes/Voices/FCH-TTS/helpers/trainer.py", line 419, in _validate loss.item(), l1_loss.item(), ssim_loss.item(), drn_loss.item() AttributeError: 'float' object has no attribute 'item'

    opened by longglecc 1
  • Data Error!!!

    Data Error!!!

    dear atomicoo: There is error url : https://open-speech-data.oss-cn-hangzhou.aliyuncs.com, can not download data when run prepare_dataset.py

    and could U pls share the the structure of directory “datasets” ,it's differece between your script dataset_path = osp.join(datasets_path, dataset_dir) wavfile_path = osp.join(dataset_path, "wavs") melspec_path = osp.join(dataset_path, "mels") and office data of BiaoBei PhoneLabeling ProsodyLabeling Wave

    opened by xyx361100238 0
  • Synthesize - MelGan: Run out of Memory with CUDA

    Synthesize - MelGan: Run out of Memory with CUDA

    Env: (Nvidia T4,torch 1.9.0) Tried the quick start with steps: $ conda create -n ParallelTTS python=3.7.9 $ conda activate ParallelTTS $ pip install -r requirements.txt $ python synthesize.py
    --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth
    --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth
    --input_texts ./samples/english/synthesize.txt Failed with: image

    opened by babysor 1
Owner
Atomicoo
Atomicoo
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023