🌎
One model to speak them allAudio | Language | Text |
---|---|---|
▷ | Chinese | 人人生而自由,在尊严和权利上一律平等。 |
▷ | English | All human beings are born free and equal in dignity and rights. |
▷ | Japanese | すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。 |
▷ | Korean | 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다. |
▷ | German | Alle Menschen sind frei und gleich an Würde und Rechten geboren. |
▷ | Russian | Все люди рождаются свободными и равными в своем достоинстве и правах. |
▷ | Spanish | Todos los seres humanos nacen libres e iguales en dignidad y derechos. |
▷ | Gujarati | પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે. |
...even when there are only 30 utterances for training | ||
▷ | Norwegian | Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter. |
▷ | Romanian | Toate ființele umane se nasc libere și egale în demnitate și în drepturi. |
▷ | Greek | Όλοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα. |
This is an implementation of the paper Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis, which can handle 40+ languages in a single model, and learn a brand new language in few shots or minutes of recordings. The code is partially based on the open-source Tacotron2 and Transformer-TTS. More audio samples of the paper are available here.
Quickstart
We follow the paper's training recipe, but with open datasets instead. By a combination of 15 speech datasets with 572 speakers in 38 languages, we can reach results similar to what we demonstrated in the paper to an extent, as shown by the audio samples above. These datasets are listed below, the preprocessor scripts below are located at corpora/
. Locations and details to download the data are also given in the respective preprocessor.
Name | Preprocessor script name | Languages |
---|---|---|
M-AILABS | caito | es-es, fr-fr, de-de, uk-ua, ru-ru, pl-pl, it-it, en-us, en-uk |
CSS-10 | css10 | es-es, fr-fr, ja-jp, de-de, fi-fi, hu-hu, ja-jp, nl-nl, ru-ru, zh-cn |
SIWIS | siwis | fr-fr |
JSUT | jsut | ja-jp |
KSS | kss | ko-kr |
Databaker | databaker | zh-cn |
LJSpeech | ljspeech | en-us |
NST | nst | da-dk, nb-no |
TTS-Portuguese | portuguese | pt-br |
Thorsten Mueller | thorsten | de-de |
bn-bd, bn-in, ca-es, eu-es, gl-es, gu-in, jv-id, km-kh, kn-in, ml-in, mr-in, my-mm, ne-np, si-lk, su-id, ta-in, te-in, yo-ng | ||
RuLS | lsru | ru-ru |
English Bible | enbible | en-us |
Hifi-TTS | hifitts | en-us, en-uk |
RSS | rss | ro-ro |
Preprocessing
- Please download and extract these datasets to the
dataset_path
specified incorpora/__init__.py
. You can change thedataset_path
,transformed_path
andpacked_path
to your own. - Run the preprocessor for each dataset given in
corpora
. The results are saved totransformed_path
.include_corpus
incorpora/__init__.py
could be modified to add or remove datasets to be used. Particularly, you may refer to the preprocessors to include your own datasets to the training,
and then add the dataset toinclude_corpus
anddataset_language
incorpora/__init__.py
. - Run the
corpora/process_corpus.py
, which filters the dataset, trims the audios, produces the metadata, generates the mel spectrograms, and pack all the features into a single zip file. The processed dataset will be put atpacked_path
, which uses around 100GB space. See the script for details.
Training
Similarly, we split the dataset into three tiers. Below are the commands to train and evaluate on each tier. Please substitute the directories with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.
Besides, to report CER, you need to create azure_key.json
with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}
, see utils/transcribe.py
. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.
T1
python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:ja-jp:es-es --warmup_languages=en-us --ddp=True --eval_steps=40000:100000
python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=100000 --eval_languages=en-us:de-de:ja-jp
T2
python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn --ddp=True --hparams="warmup_steps=350000" --restore_from=T1_MODEL_DIR/model.ckpt-350000 --eval_steps=400000:450000 --eval_languages=zh-cn:ru-ru:it-it
python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=400000 --eval_languages=zh-cn:ru-ru:it-it
T3
python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np:bn-bd: bn-in:si-lk --ddp=True --hparams="warmup_steps=650000,batch_frame_quad_limit=6500000" --restore_from=T2_MODEL_DIR/model.ckpt-650000 --eval_steps=700000:750000 --eval_languages=ko-kr:da-dk:te-in
python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=ko-kr:da-dk:te-in
Few-shot adaptation
Norwegian Bokmal (nb-no), Greek (el-gr), and Romanian (ro-ro) are excluded from the training dataset and can be used for few-shot/low-resource adaptation. The command below gives an example for adaptation to el-gr with 100 samples, and you may substitute the --adapt_languages
and --downsample_languages
with your own.
python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np: bn-bd:bn-in:si-lk --adapt_languages=el-gr --downsample_languages=el-gr:100 --ddp=True --hparams="warmup_steps=800000" --restore_from=T3_MODEL_DIR/model.ckpt-700000
python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=el-gr
Performance
Below listed the best CERs of selected languages reached by models from each tier on these open datasets, as well as the CERs on few-shot adaptation. The CERs are based on Azure Speech-to-Text.
T1 | en-us | de-de | ja-jp |
2.68% | 2.17% | 19.06% | |
T2 | it-it | ru-ru | zh-cn |
1.95% | 3.21% | 7.30% | |
T3 | da-dk | ko-kr | te-in |
1.31% | 0.94% | 4.41% |
Adaptation
#Samples | nb-no | el-gr | ro-ro |
---|---|---|---|
30 | 9.18% | 5.71% | 5.58% |
100 | 3.63% | 4.63% | 4.89% |
Pretrained Models
The pretrained models are available at OneDrive Link. Metadata for eval are also given to aid fast reproduction. Below listed are the models provided.
Base models
- T1 350k steps, ready for T2
- T2 650k steps, ready for T3
- T3 700k steps, ready for adaptation
- T3 1.16M steps, which reaches satisfactory performances on most languages
Few-shot adaptation
- nb-no, 30 samples, at 710k steps
- nb-no, 100 samples, at 750k steps
- el-gr, 30 samples, at 1M steps
- el-gr, 100 samples, at 820k steps
- ro-ro, 30 samples, at 970k steps
- ro-ro, 100 samples, at 910k steps
Synthesis
To synthesize audios from the pretrained models, download the models along with the metadata files (lang_id.json
and spk_id.json
). Since there are no ground truth mels, you need to create metadata with dummy mel targets information , and run eval.py
without neither --zipfilepath
specified nor mels.zip
present in --data-dir
. The metadata file takes the form of SPEAKERNAME_FILEID|DUMMY_LENGTH|TEXT|LANG
for each line of the file. For example, you can generate the audio examples above by saving the following metadata to script.txt
:
databaker_0|500|人人生而自由,在尊严和权利上一律平等。|zh-cn
ljspeech_0|500|All human beings are born free and equal in dignity and rights.|en-us
jsut_0|500|すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。|ja-jp
kss_0|500|모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.|ko-kr
thorsten_0|500|Alle Menschen sind frei und gleich an Würde und Rechten geboren.|de-de
hajdurova_0|500|Все люди рождаются свободными и равными в своем достоинстве и правах.|ru-ru
tux_0|500|Todos los seres humanos nacen libres e iguales en dignidad y derechos.|es-es
guf02858_0|500|પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.|gu-in
, and with the command python eval.py --model-dir=T3_MODEL_DIR --log-dir=OUTPUT_DIR --data-dir=METADATA_DIR --eval_meta=script.txt --eval_step=1160000 --no_wait=True
. You may refer to lang_id.json
and spk_id.json
to synthesize audios with other languages or speakers.
The waveforms are produced by Griffin-Lim, while mel spectrograms are also saved to SPEAKERNAME_FILEID.npy
, which are normalized to a [-4, 4] range. Pretrained vocoders like Wavenet can be used to reach better quality. Those using recipes similar to Tacotron2 should be applicable to these mels, although you need to map mels to a range of [0, 1], simply by mels = (mels + 8) / 4
.