Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Mutian He

Last update: Nov 14, 2022

Related tags

Deep Learning speech-synthesis

Overview

One model to speak them all 🌎

Audio	Language	Text
▷	Chinese	人人生而自由，在尊严和权利上一律平等。
▷	English	All human beings are born free and equal in dignity and rights.
▷	Japanese	すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。
▷	Korean	모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.
▷	German	Alle Menschen sind frei und gleich an Würde und Rechten geboren.
▷	Russian	Все люди рождаются свободными и равными в своем достоинстве и правах.
▷	Spanish	Todos los seres humanos nacen libres e iguales en dignidad y derechos.
▷	Gujarati	પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.
...even when there are only 30 utterances for training
▷	Norwegian	Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter.
▷	Romanian	Toate ființele umane se nasc libere și egale în demnitate și în drepturi.
▷	Greek	Όλοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.

This is an implementation of the paper Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis, which can handle 40+ languages in a single model, and learn a brand new language in few shots or minutes of recordings. The code is partially based on the open-source Tacotron2 and Transformer-TTS. More audio samples of the paper are available here.

Quickstart

We follow the paper's training recipe, but with open datasets instead. By a combination of 15 speech datasets with 572 speakers in 38 languages, we can reach results similar to what we demonstrated in the paper to an extent, as shown by the audio samples above. These datasets are listed below, the preprocessor scripts below are located at corpora/. Locations and details to download the data are also given in the respective preprocessor.

Name	Preprocessor script name	Languages
M-AILABS	caito	es-es, fr-fr, de-de, uk-ua, ru-ru, pl-pl, it-it, en-us, en-uk
CSS-10	css10	es-es, fr-fr, ja-jp, de-de, fi-fi, hu-hu, ja-jp, nl-nl, ru-ru, zh-cn
SIWIS	siwis	fr-fr
JSUT	jsut	ja-jp
KSS	kss	ko-kr
Databaker	databaker	zh-cn
LJSpeech	ljspeech	en-us
NST	nst	da-dk, nb-no
TTS-Portuguese	portuguese	pt-br
Thorsten Mueller	thorsten	de-de
Google	google	bn-bd, bn-in, ca-es, eu-es, gl-es, gu-in, jv-id, km-kh, kn-in, ml-in, mr-in, my-mm, ne-np, si-lk, su-id, ta-in, te-in, yo-ng
RuLS	lsru	ru-ru
English Bible	enbible	en-us
Hifi-TTS	hifitts	en-us, en-uk
RSS	rss	ro-ro

Preprocessing

Please download and extract these datasets to the dataset_path specified in corpora/__init__.py. You can change the dataset_path, transformed_path and packed_path to your own.
Run the preprocessor for each dataset given in corpora. The results are saved to transformed_path. include_corpus in corpora/__init__.py could be modified to add or remove datasets to be used. Particularly, you may refer to the preprocessors to include your own datasets to the training,
and then add the dataset to include_corpus and dataset_language in corpora/__init__.py.
Run the corpora/process_corpus.py, which filters the dataset, trims the audios, produces the metadata, generates the mel spectrograms, and pack all the features into a single zip file. The processed dataset will be put at packed_path, which uses around 100GB space. See the script for details.

Training

Similarly, we split the dataset into three tiers. Below are the commands to train and evaluate on each tier. Please substitute the directories with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

T1

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=100000 --eval_languages=en-us:de-de:ja-jp

T2

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn --ddp=True --hparams="warmup_steps=350000" --restore_from=T1_MODEL_DIR/model.ckpt-350000 --eval_steps=400000:450000 --eval_languages=zh-cn:ru-ru:it-it

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=400000 --eval_languages=zh-cn:ru-ru:it-it

T3

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np:bn-bd: bn-in:si-lk --ddp=True --hparams="warmup_steps=650000,batch_frame_quad_limit=6500000" --restore_from=T2_MODEL_DIR/model.ckpt-650000 --eval_steps=700000:750000 --eval_languages=ko-kr:da-dk:te-in

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=ko-kr:da-dk:te-in

Few-shot adaptation

Norwegian Bokmal (nb-no), Greek (el-gr), and Romanian (ro-ro) are excluded from the training dataset and can be used for few-shot/low-resource adaptation. The command below gives an example for adaptation to el-gr with 100 samples, and you may substitute the --adapt_languages and --downsample_languages with your own.

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np: bn-bd:bn-in:si-lk --adapt_languages=el-gr --downsample_languages=el-gr:100 --ddp=True --hparams="warmup_steps=800000" --restore_from=T3_MODEL_DIR/model.ckpt-700000

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=el-gr

Performance

Below listed the best CERs of selected languages reached by models from each tier on these open datasets, as well as the CERs on few-shot adaptation. The CERs are based on Azure Speech-to-Text.

T1	en-us	de-de	ja-jp
T1	2.68%	2.17%	19.06%
T2	it-it	ru-ru	zh-cn
T2	1.95%	3.21%	7.30%
T3	da-dk	ko-kr	te-in
T3	1.31%	0.94%	4.41%

Adaptation

#Samples	nb-no	el-gr	ro-ro
30	9.18%	5.71%	5.58%
100	3.63%	4.63%	4.89%

Pretrained Models

The pretrained models are available at OneDrive Link. Metadata for eval are also given to aid fast reproduction. Below listed are the models provided.

Base models

T1 350k steps, ready for T2
T2 650k steps, ready for T3
T3 700k steps, ready for adaptation
T3 1.16M steps, which reaches satisfactory performances on most languages

Few-shot adaptation

nb-no, 30 samples, at 710k steps
nb-no, 100 samples, at 750k steps
el-gr, 30 samples, at 1M steps
el-gr, 100 samples, at 820k steps
ro-ro, 30 samples, at 970k steps
ro-ro, 100 samples, at 910k steps

Synthesis

To synthesize audios from the pretrained models, download the models along with the metadata files (lang_id.json and spk_id.json). Since there are no ground truth mels, you need to create metadata with dummy mel targets information , and run eval.py without neither --zipfilepath specified nor mels.zip present in --data-dir. The metadata file takes the form of SPEAKERNAME_FILEID|DUMMY_LENGTH|TEXT|LANG for each line of the file. For example, you can generate the audio examples above by saving the following metadata to script.txt:

databaker_0|500|人人生而自由，在尊严和权利上一律平等。|zh-cn
ljspeech_0|500|All human beings are born free and equal in dignity and rights.|en-us
jsut_0|500|すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。|ja-jp
kss_0|500|모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.|ko-kr
thorsten_0|500|Alle Menschen sind frei und gleich an Würde und Rechten geboren.|de-de
hajdurova_0|500|Все люди рождаются свободными и равными в своем достоинстве и правах.|ru-ru
tux_0|500|Todos los seres humanos nacen libres e iguales en dignidad y derechos.|es-es
guf02858_0|500|પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.|gu-in

, and with the command python eval.py --model-dir=T3_MODEL_DIR --log-dir=OUTPUT_DIR --data-dir=METADATA_DIR --eval_meta=script.txt --eval_step=1160000 --no_wait=True. You may refer to lang_id.json and spk_id.json to synthesize audios with other languages or speakers.

The waveforms are produced by Griffin-Lim, while mel spectrograms are also saved to SPEAKERNAME_FILEID.npy, which are normalized to a [-4, 4] range. Pretrained vocoders like Wavenet can be used to reach better quality. Those using recipes similar to Tacotron2 should be applicable to these mels, although you need to map mels to a range of [0, 1], simply by mels = (mels + 8) / 4.

Comments

How to synthesis a wav file.

It's a great news that you release your pretrained models. I have download checkpoints, and json files. How to synthesis a wav file, given a text and language? Specially in the case of code-switched text, like "Hello, 我是AI助手，很高兴认识你, nice to meet you." Thanks in advance.

opened by mudong0419 4
Export to onnx

Thanks for your great work, I have trained a zh-en code-switching model finetuning on checkpoint 1160000. But inference speed is no fast enough. I want to export model to onnx, and speed up with onnx_runtime, so the problem is how to export this model to onnx, any advice please. BTW, it's applied in Huggingface transformes.

opened by mudong0419 3
Few-shot adaptation for the low resource language Cherokee
I'm working on trying get speech synthesis working for the Cherokee language.

However, looking at the corpora that was used to train the published weights, the amount of data to download exceeds my Internet's and System's capability.

Would the few-shot adaptation from the published weights work if only provided, say, "en-US" ?

As an aside... how much GPU ram is required, I only have a GTX 3070 available here.

python -m torch.distributed.launch --nproc_per_node=1 train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us --adapt_languages=chr-w-mco --ddp=True --hparams="warmup_steps=800000" --restore_from=T3_MODEL_DIR/model.ckpt-700000
opened by michael-conrad 2
How to finetune on my data.

I have downloaded VCTK(English) and ST-CMDS(Chinese) dataset, and want to synthesis Chinese-English code-switched speech. What should I do to finetune base on 1160000 checkpoint? Must I download all the dataset list in readme, and then train with --adapt_languages en-us:zh-cn? Many thanks in advance.

opened by mudong0419 2
Looking forward to pre-trained models

Congratulations to your excellent work! It will be really helpful in multi language TTS systems. Looking forward to your pre-trained models, which could save so much time for making up the user model. ^^

opened by blx0102 2
The scottish_english_male archive for Google appears to have changed the extension for line_index.tsv to line_index.csv

Traceback (most recent call last): File "google.py", line 135, in merge() File "google.py", line 51, in merge lines = open(os.path.join(f, "line_index.tsv"), "r", encoding='utf-8').read().splitlines() FileNotFoundError: [Errno 2] No such file or directory: '../data/datasets/google/scottish_english_male/line_index.tsv'

opened by michael-conrad 1
Setup instructions are incomplete: No module named 'hyperparams'
I'm trying to do the initial preprocess step, but the run errors out with No module named 'hyperparams'.

python corpora/process_corpus.py Traceback (most recent call last): File "/opt/data1/muksihs/git/Cherokee-TTS-fst/corpora/process_corpus.py", line 3, in <module> from hyperparams import hparams as hp ModuleNotFoundError: No module named 'hyperparams'

I tried doing pip -e for the root directory of the project, but that complains: ERROR: File "setup.py" or "setup.cfg" not found. Directory cannot be installed in editable mode:

Assistance would be appreciated.
opened by michael-conrad 1
Speaker information

Thanks for your excellent work. It helps me a lot for code-switched TTS. I'm trying to synthesis ZH-EN code-switched speech, the female voice with speaker of databaker is very good. Is there Chinese male speaker in all speakers? Could you please provide speaker's information such as dataset, language?

opened by mudong0419 1
quick question:confirmation of procedure to utilize and run project

I had a some samples of japanese voice which i want to adapt to this tts.probably around 5-10minute

i had downloaded the t3 pretrained model next step should be download the repo and unpack. then run:pip install -r requirement.txt then run:python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np: bn-bd:bn-in:si-lk --adapt_languages=ja-jp --downsample_languages=ja-jp:100 --ddp=True --hparams="warmup_steps=800000" --restore_from=MODEL_DIR/model.ckpt-700000

how should the data-dir folder layout be?

where should i put my obtained voice sample?is it in data-dir?

I assume all my voice sample need to be re-encoded to mono,wav,22.05kHz,right?

opened by RESDXChgfore9hing 2

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Related tags

Overview

One model to speak them all 🌎

Quickstart

Preprocessing

Training

T1

T2

T3

Few-shot adaptation

Performance

Adaptation

Pretrained Models

Base models

Few-shot adaptation

Synthesis

Comments

Owner

Mutian He

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Few-NERD: Not Only a Few-shot NER Dataset

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

Code for "LoRA: Low-Rank Adaptation of Large Language Models"

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Meta Representation Transformation for Low-resource Cross-lingual Learning

FindFunc is an IDA PRO plugin to find code functions that contain a certain assembly or byte pattern, reference a certain name or string, or conform to various other constraints.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Meta Language-Specific Layers in Multilingual Language Models

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

True Few-Shot Learning with Language Models

An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

The code of Zero-shot learning for low-light image enhancement based on dual iteration

Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network."