A method to generate speech across multiple speakers

Related tags

Text Data & NLP loop
Overview

VoiceLoop

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.

Quick Links

Quick Start

Follow the instructions in Setup and then simply execute:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

Results will be placed in models/vctk/results. It will generate 2 samples:

You can also generate the same text but with a different speaker, specifically:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth

Which will generate the following sample.

Here is the corresponding attention plot:

Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.

Finally, free text is also supported:

python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth

Setup

Requirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. Generation requires installing phonemizer, follow the setup instructions there. The current version of the code requires CUDA support for training. Generation can be done on the CPU.

git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txt

Data

The data used to train the models in the paper can be downloaded via:

bash scripts/download_data.sh

The script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.

The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data as follows:

loop
├── data
    └── vctk
        ├── norm_info
        │   ├── norm.dat
        ├── numpy_feautres
        │   ├── p294_001.npz
        │   ├── p294_002.npz
        │   └── ...
        └── numpy_features_valid

The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.

Pretrained Models

Pretrainde models can be downloaded via:

bash scripts/download_models.sh

After downloading, the models will be located under subfolder models as follows:

loop
├── data
├── models
    ├── blizzard
    ├── vctk
    │   ├── args.pth
    │   └── bestmodel.pth
    └── vctk_alt

Update 10/25/2017: Single speaker model available in models/blizzard/

SPTK and WORLD

Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:

bash scripts/download_tools.sh

Which results the following sub directories:

loop
├── data
├── models
├── tools
    ├── SPTK-3.9
    └── WORLD

Training

Single-Speaker

Single speaker model is trained on blizzard 2011. Data should be downloaded and prepared as described above. Once the data is ready, run:

python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10

Then, continue training the model with :

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

Multi-Speaker

Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:

python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90

Then, continue training the model using noise level of 2, on full sequences:

python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90

Citation

If you find this code useful in your research then please cite:

@article{taigman2017voice,
  title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},
  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  year            = {2017}
  month           = October,
}

License

Loop has a CC-BY-NC license.

Comments
  • New Dataset

    New Dataset

    Hi, So everything worked perfectly with your pre-process Vctk. Now I want to test with Nancy data set. I'm using the script you suggested, but I have 2 questions:

    1. When I run the script I get 2 files on the norm_info folder: label_norm_HTS_420.dat and norm_info_mgc_lf0_vuv_bap_63_MVN.dat. Based on the shape the correct file is norm_info_mgc_lf0_vuv_bap_63_MVN.dat, but I want to be sure.

    2. In order to combine both datasets, should I have to run the script for each speaker and them combine somehow the norms file, or should I put all data in one folder and process it?

    Thanks.

    opened by jdbermeol 35
  • Issue for training on new Dataset.

    Issue for training on new Dataset.

    Hi,

    Thanks for sharing the project and I am doing some experiment with the tools. I have 2 questions.

    1. the npz file download with download_data.sh is different with the ones generated by the extract_feats.py according to the same sample wave/text file. let's say p294_001. Why is this happened? other arrays are also have some differences. download one: phonemes [28 22 19 41 21 3 22 31 34 11 22 5] durations [29 4 25 18 21 27 11 32 7 12 39 3] extract one: phonemes [28 22 19 40 21 3 22 31 33 11 22 5] durations [ 9 6 23 33 6 17 24 32 3 14 28 32]

    2. If I want to retrain the model using the data, I need to extract features to prepare the npz files, do I need to put the training set and validation set together to run extract_feats.py and get the norm.dat? or I need only deal with the training data to get the norm.dat then kick-off training?

    Thank you for your guidance in advanced. :)

    opened by hepower 10
  • Reproducing the results

    Reproducing the results

    Hi, thanks for open sourcing the code!

    I am trying to reproduce your results. However, I am running into problems. I have been training:

    • sequence length: 100
    • epoch: 90
    • only American accent VCTK speaker samples
    • noise level 4

    So the problem is that only some speakers actually produce a speech signal based on the input. The majority of speakers only produce noise. However, the speech producing speakers are depended on the actual phoneme input. The problem seems to be that the attention does not work correctly for these samples. The attention basically stays at the beginning of the sequence and does not advance.

    Did you have a similar issue when training the model? Or do you might have an idea what the problem could be?

    good attention with speech output: p226_009_11.pdf p225_005_4.pdf

    somewhat working: p226_009_2.pdf

    Most examples: p226_009_9.pdf p226_009_13.pdf p226_009_1.pdf

    Thanks!

    opened by pfriesch 9
  • ERROR: Failed to find norm file.

    ERROR: Failed to find norm file.

    python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
    ERROR: Failed to find norm file.
    
    python2 generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
    Traceback (most recent call last):
      File "generate.py", line 10, in <module>
        import phonemizer
    ImportError: No module named phonemizer
    
    which python
    /Users/my_user/external_projects/text-to-speech/facebook-voiceloop/my_env/bin/python
    which python2
    /Users/my_user/external_projects/text-to-speech/facebook-voiceloop/my_env/bin/python2
    
    opened by mrgloom 7
  • Blizzard Model

    Blizzard Model

    Can't use the Blizzard model without the original training data:

    Traceback (most recent call last):
      File "generate.py", line 156, in <module>
        main()
      File "generate.py", line 83, in main
        train_dataset = NpzFolder(train_args.data + '/numpy_features')
      File "/home/michael/Desktop/loop/data.py", line 84, in __init__
        self.NPZ_EXTENSION))
    RuntimeError: Found 0 npz in subfolders of: data/blizzard/numpy_features
    Supported image extensions are: npz
    

    Looks generate.py uses parameters in the training data to generate.py

    opened by PetrochukM 6
  • New Language

    New Language

    First of all thank you for releasing the codes. I would like to know how difficult will be to do the training on a speakers data on a new language such as Turkish. As far as I sow during the generation step there is need for some kind of pronunciation dictionary. But what about pre-processing steps, Merlin and other tools, are they language agnostic. Thank you in advance

    opened by gaziway 4
  • Adding new datasets to train

    Adding new datasets to train

    Hi,

    How can you add new datasets (voices) for training? I want to use this datasets. https://linksync-2032.kxcdn.com/wp-content/uploads/2017/06/female-voice-1.zip

    they are all in .wav files and I want them to add as a dataset so I can use that voice.

    opened by jaxlinksync 4
  • Getting errors in generate.py

    Getting errors in generate.py

    python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth Traceback (most recent call last): File "generate.py", line 153, in <module> main() File "generate.py", line 142, in main norm_path) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 257, in generate_merlin_wav weight=os.path.join(gen_dir, 'weight')), shell=True) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 121, in pe for line in execute(cmd, shell=shell): File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 114, in execute raise subprocess.CalledProcessError(return_code, cmd) subprocess.CalledProcessError: Command 'echo 1 1 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 | /mnt/sdb1/Learning/pytorch/loop/tools/SPTK-3.9/x2x +af > /mnt/sdb1/Learning/pytorch/loop/models/vctk/results/weight' returned non-zero exit status 127

    opened by amitabhpatil 4
  • Training Error. Epoch 5.

    Training Error. Epoch 5.

    This is the stack trace:

    Traceback (most recent call last):
      File "train.py", line 211, in <module>
        main()
      File "train.py", line 199, in main
        train(model, criterion, optimizer, epoch, train_losses)
      File "train.py", line 119, in train
        loss = criterion(output, target[0], target[1])
      File "/home/michael/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/michael/Desktop/loop/model.py", line 42, in forward
        mask_ = mask.expand_as(input)
      File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/variable.py", line 655, in expand_as
        return Expand(tensor.size())(self)
      File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py", line 115, in forward
        result = i.expand(*self.sizes)
    RuntimeError: The expanded size of the tensor (21) must match the existing size (5) at               non-singleton dimension 0. at /b/wheel/pytorch-src/torch/lib/TH/THStorage.c:99
    

    Any clue what is going on?

    opened by PetrochukM 3
  • Feature extraction

    Feature extraction

    I trained loop with a subset of vctk data (American speakers). I found that the audio from those speakers when I run generate.py using my trained model are pretty bad. I just hear only a couple of words in a sentence and the rest is silence or noise.

    My guess is that something went wrong during feature extraction. When I compare same feature extracted files i.e. p294_001.npz from the given s3 bucket and the one I feature extracted by running extract_feats.py, I see that vuv_idx from s3 has larger numbers (range: -5 to 5) compared to mine (range: -10e-02 to 5 )

    I also noticed that text_features and audio_features are of different shape: (226, 420) - s3 (540, 420) - me

    Other features like durations and code2phone also look different.

    May I know what changes I've to make to the extract_feats.py to get similar features as the one in s3?

    opened by jayavanth 3
  • error on generate.py execution

    error on generate.py execution

    I get an error upon executing: python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

    (gpu_13) abhinav@ubuntu11:~/.../loop$ python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
    Traceback (most recent call last):
      File "generate.py", line 153, in <module>
        main()
      File "generate.py", line 132, in main
        out, attn = model([txt, spkr], feat)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 247, in forward
        context, ident = self.encoder(src[0], src[1])
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 66, in forward
        outputs = self.lut_p(input)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/sparse.py", line 94, in forward
        self.scale_grad_by_freq, self.sparse
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/_functions/thnn/sparse.py", line 48, in forward
        cls._renorm(indices, weight, max_norm, norm_type)
    TypeError: _renorm() takes exactly 5 arguments (4 given)
    
    

    I have followed all steps in the Setup segment

    opened by abhinonymous 3
  • Block on preprocessing

    Block on preprocessing

    Hi, I am trying build my own dataset use this code. But when I excuted this code it somehow blocked at wav file feature extraction and consuming all 48 CPU. I waited for a whole night and nothing happened, any idea about it? I am using Ubuntu 16.04 and tried python 3.7 and 2.7. My wav file is in 48000 sr and 16 bit.

    opened by zhang-yunke 0
  • Batch

    Batch

    when you give input to the model, do you give one npz to the model at a time? It seems

    1. train() is called for every epoch
    2. "for full_txt, full_feat, spkr in train_enum" is called for every batch
    3. "for txt, feat, spkr, start in batch_iter:" is called for every npz ==> model.forward() is called here then the loss is sumed up for batch Then what is the meaning of having batch at all when model is not called batch-wise?
    opened by hash2430 0
  • TBPTTIter.split_length() error

    TBPTTIter.split_length() error

    TBPTTIter.split_length() gives following error "TypeError: mul(): argument 'other' (position 1) must be Tensor, not list" while trying
    "seq = [self.seq_len] * (seq_size / self.seq_len)" It only makes sense to me that line raises that error. Does that line work well for others?

    I mean, that could have been caused by pytorch version mismatch but it's hard to believe that line works well for other versions of pytorch.

    I'm confused if I should work on setting the right environment or fix that function to solve that error.

    opened by hash2430 1
  • Look like it fails on '!' character.

    Look like it fails on '!' character.

    Look like it fails on '!' character:

    python2 generate.py --text "Hello there!" --checkpoint models/blizzard/bestmodel.pth
    -bash: !": event not found
    

    Howewer looks like it works if it putted in .sh file.

    opened by mrgloom 0
  • Understanding feat tensor dimensions

    Understanding feat tensor dimensions

    What does feat tensor dimensions mean? https://github.com/facebookresearch/loop/blob/331cbd0ac2c5824998424095f91b80affff50d86/generate.py#L124

    Looks like when generation is done from text feat tensor empty before inference and after, so it's just a way of input of precomputed features when loading from npz?

    opened by mrgloom 1
Owner
Facebook Archive
These projects have been archived and are generally unsupported, but are still available to view and use
Facebook Archive
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 7, 2022
Rhasspy 673 Dec 28, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features ?? Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 6, 2023
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features ?? Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Fake News Detection Overview The proliferation of disinformation across social media has led the application of deep learning techniques to detect fak

Kushal Shingote 1 Feb 8, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 6, 2021
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022