PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

Related tags

Deep Learning loop
Overview

VoiceLoop

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.

Quick Links

Quick Start

Follow the instructions in Setup and then simply execute:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

Results will be placed in models/vctk/results. It will generate 2 samples:

You can also generate the same text but with a different speaker, specifically:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth

Which will generate the following sample.

Here is the corresponding attention plot:

Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.

Finally, free text is also supported:

python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth

Setup

Requirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. Generation requires installing phonemizer, follow the setup instructions there. The current version of the code requires CUDA support for training. Generation can be done on the CPU.

git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txt

Data

The data used to train the models in the paper can be downloaded via:

bash scripts/download_data.sh

The script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.

The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data as follows:

loop
├── data
    └── vctk
        ├── norm_info
        │   ├── norm.dat
        ├── numpy_feautres
        │   ├── p294_001.npz
        │   ├── p294_002.npz
        │   └── ...
        └── numpy_features_valid

The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.

Pretrained Models

Pretrainde models can be downloaded via:

bash scripts/download_models.sh

After downloading, the models will be located under subfolder models as follows:

loop
├── data
├── models
    ├── blizzard
    ├── vctk
    │   ├── args.pth
    │   └── bestmodel.pth
    └── vctk_alt

Update 10/25/2017: Single speaker model available in models/blizzard/

SPTK and WORLD

Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:

bash scripts/download_tools.sh

Which results the following sub directories:

loop
├── data
├── models
├── tools
    ├── SPTK-3.9
    └── WORLD

Training

Single-Speaker

Single speaker model is trained on blizzard 2011. Data should be downloaded and prepared as described above. Once the data is ready, run:

python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10

Then, continue training the model with :

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

Multi-Speaker

Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:

python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90

Then, continue training the model using noise level of 2, on full sequences:

python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90

Citation

If you find this code useful in your research then please cite:

@article{taigman2017voice,
  title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},
  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  year            = {2017}
  month           = October,
}

License

Loop has a CC-BY-NC license.

Comments
  • New Dataset

    New Dataset

    Hi, So everything worked perfectly with your pre-process Vctk. Now I want to test with Nancy data set. I'm using the script you suggested, but I have 2 questions:

    1. When I run the script I get 2 files on the norm_info folder: label_norm_HTS_420.dat and norm_info_mgc_lf0_vuv_bap_63_MVN.dat. Based on the shape the correct file is norm_info_mgc_lf0_vuv_bap_63_MVN.dat, but I want to be sure.

    2. In order to combine both datasets, should I have to run the script for each speaker and them combine somehow the norms file, or should I put all data in one folder and process it?

    Thanks.

    opened by jdbermeol 35
  • Issue for training on new Dataset.

    Issue for training on new Dataset.

    Hi,

    Thanks for sharing the project and I am doing some experiment with the tools. I have 2 questions.

    1. the npz file download with download_data.sh is different with the ones generated by the extract_feats.py according to the same sample wave/text file. let's say p294_001. Why is this happened? other arrays are also have some differences. download one: phonemes [28 22 19 41 21 3 22 31 34 11 22 5] durations [29 4 25 18 21 27 11 32 7 12 39 3] extract one: phonemes [28 22 19 40 21 3 22 31 33 11 22 5] durations [ 9 6 23 33 6 17 24 32 3 14 28 32]

    2. If I want to retrain the model using the data, I need to extract features to prepare the npz files, do I need to put the training set and validation set together to run extract_feats.py and get the norm.dat? or I need only deal with the training data to get the norm.dat then kick-off training?

    Thank you for your guidance in advanced. :)

    opened by hepower 10
  • Reproducing the results

    Reproducing the results

    Hi, thanks for open sourcing the code!

    I am trying to reproduce your results. However, I am running into problems. I have been training:

    • sequence length: 100
    • epoch: 90
    • only American accent VCTK speaker samples
    • noise level 4

    So the problem is that only some speakers actually produce a speech signal based on the input. The majority of speakers only produce noise. However, the speech producing speakers are depended on the actual phoneme input. The problem seems to be that the attention does not work correctly for these samples. The attention basically stays at the beginning of the sequence and does not advance.

    Did you have a similar issue when training the model? Or do you might have an idea what the problem could be?

    good attention with speech output: p226_009_11.pdf p225_005_4.pdf

    somewhat working: p226_009_2.pdf

    Most examples: p226_009_9.pdf p226_009_13.pdf p226_009_1.pdf

    Thanks!

    opened by pfriesch 9
  • ERROR: Failed to find norm file.

    ERROR: Failed to find norm file.

    python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
    ERROR: Failed to find norm file.
    
    python2 generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
    Traceback (most recent call last):
      File "generate.py", line 10, in <module>
        import phonemizer
    ImportError: No module named phonemizer
    
    which python
    /Users/my_user/external_projects/text-to-speech/facebook-voiceloop/my_env/bin/python
    which python2
    /Users/my_user/external_projects/text-to-speech/facebook-voiceloop/my_env/bin/python2
    
    opened by mrgloom 7
  • Blizzard Model

    Blizzard Model

    Can't use the Blizzard model without the original training data:

    Traceback (most recent call last):
      File "generate.py", line 156, in <module>
        main()
      File "generate.py", line 83, in main
        train_dataset = NpzFolder(train_args.data + '/numpy_features')
      File "/home/michael/Desktop/loop/data.py", line 84, in __init__
        self.NPZ_EXTENSION))
    RuntimeError: Found 0 npz in subfolders of: data/blizzard/numpy_features
    Supported image extensions are: npz
    

    Looks generate.py uses parameters in the training data to generate.py

    opened by PetrochukM 6
  • New Language

    New Language

    First of all thank you for releasing the codes. I would like to know how difficult will be to do the training on a speakers data on a new language such as Turkish. As far as I sow during the generation step there is need for some kind of pronunciation dictionary. But what about pre-processing steps, Merlin and other tools, are they language agnostic. Thank you in advance

    opened by gaziway 4
  • Adding new datasets to train

    Adding new datasets to train

    Hi,

    How can you add new datasets (voices) for training? I want to use this datasets. https://linksync-2032.kxcdn.com/wp-content/uploads/2017/06/female-voice-1.zip

    they are all in .wav files and I want them to add as a dataset so I can use that voice.

    opened by jaxlinksync 4
  • Getting errors in generate.py

    Getting errors in generate.py

    python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth Traceback (most recent call last): File "generate.py", line 153, in <module> main() File "generate.py", line 142, in main norm_path) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 257, in generate_merlin_wav weight=os.path.join(gen_dir, 'weight')), shell=True) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 121, in pe for line in execute(cmd, shell=shell): File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 114, in execute raise subprocess.CalledProcessError(return_code, cmd) subprocess.CalledProcessError: Command 'echo 1 1 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 | /mnt/sdb1/Learning/pytorch/loop/tools/SPTK-3.9/x2x +af > /mnt/sdb1/Learning/pytorch/loop/models/vctk/results/weight' returned non-zero exit status 127

    opened by amitabhpatil 4
  • Training Error. Epoch 5.

    Training Error. Epoch 5.

    This is the stack trace:

    Traceback (most recent call last):
      File "train.py", line 211, in <module>
        main()
      File "train.py", line 199, in main
        train(model, criterion, optimizer, epoch, train_losses)
      File "train.py", line 119, in train
        loss = criterion(output, target[0], target[1])
      File "/home/michael/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/michael/Desktop/loop/model.py", line 42, in forward
        mask_ = mask.expand_as(input)
      File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/variable.py", line 655, in expand_as
        return Expand(tensor.size())(self)
      File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py", line 115, in forward
        result = i.expand(*self.sizes)
    RuntimeError: The expanded size of the tensor (21) must match the existing size (5) at               non-singleton dimension 0. at /b/wheel/pytorch-src/torch/lib/TH/THStorage.c:99
    

    Any clue what is going on?

    opened by PetrochukM 3
  • Feature extraction

    Feature extraction

    I trained loop with a subset of vctk data (American speakers). I found that the audio from those speakers when I run generate.py using my trained model are pretty bad. I just hear only a couple of words in a sentence and the rest is silence or noise.

    My guess is that something went wrong during feature extraction. When I compare same feature extracted files i.e. p294_001.npz from the given s3 bucket and the one I feature extracted by running extract_feats.py, I see that vuv_idx from s3 has larger numbers (range: -5 to 5) compared to mine (range: -10e-02 to 5 )

    I also noticed that text_features and audio_features are of different shape: (226, 420) - s3 (540, 420) - me

    Other features like durations and code2phone also look different.

    May I know what changes I've to make to the extract_feats.py to get similar features as the one in s3?

    opened by jayavanth 3
  • error on generate.py execution

    error on generate.py execution

    I get an error upon executing: python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

    (gpu_13) abhinav@ubuntu11:~/.../loop$ python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
    Traceback (most recent call last):
      File "generate.py", line 153, in <module>
        main()
      File "generate.py", line 132, in main
        out, attn = model([txt, spkr], feat)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 247, in forward
        context, ident = self.encoder(src[0], src[1])
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 66, in forward
        outputs = self.lut_p(input)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/sparse.py", line 94, in forward
        self.scale_grad_by_freq, self.sparse
      File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/_functions/thnn/sparse.py", line 48, in forward
        cls._renorm(indices, weight, max_norm, norm_type)
    TypeError: _renorm() takes exactly 5 arguments (4 given)
    
    

    I have followed all steps in the Setup segment

    opened by abhinonymous 3
  • Block on preprocessing

    Block on preprocessing

    Hi, I am trying build my own dataset use this code. But when I excuted this code it somehow blocked at wav file feature extraction and consuming all 48 CPU. I waited for a whole night and nothing happened, any idea about it? I am using Ubuntu 16.04 and tried python 3.7 and 2.7. My wav file is in 48000 sr and 16 bit.

    opened by zhang-yunke 0
  • Batch

    Batch

    when you give input to the model, do you give one npz to the model at a time? It seems

    1. train() is called for every epoch
    2. "for full_txt, full_feat, spkr in train_enum" is called for every batch
    3. "for txt, feat, spkr, start in batch_iter:" is called for every npz ==> model.forward() is called here then the loss is sumed up for batch Then what is the meaning of having batch at all when model is not called batch-wise?
    opened by hash2430 0
  • TBPTTIter.split_length() error

    TBPTTIter.split_length() error

    TBPTTIter.split_length() gives following error "TypeError: mul(): argument 'other' (position 1) must be Tensor, not list" while trying
    "seq = [self.seq_len] * (seq_size / self.seq_len)" It only makes sense to me that line raises that error. Does that line work well for others?

    I mean, that could have been caused by pytorch version mismatch but it's hard to believe that line works well for other versions of pytorch.

    I'm confused if I should work on setting the right environment or fix that function to solve that error.

    opened by hash2430 1
  • Look like it fails on '!' character.

    Look like it fails on '!' character.

    Look like it fails on '!' character:

    python2 generate.py --text "Hello there!" --checkpoint models/blizzard/bestmodel.pth
    -bash: !": event not found
    

    Howewer looks like it works if it putted in .sh file.

    opened by mrgloom 0
  • Understanding feat tensor dimensions

    Understanding feat tensor dimensions

    What does feat tensor dimensions mean? https://github.com/facebookresearch/loop/blob/331cbd0ac2c5824998424095f91b80affff50d86/generate.py#L124

    Looks like when generation is done from text feat tensor empty before inference and after, so it's just a way of input of precomputed features when loading from npz?

    opened by mrgloom 1
Owner
Meta Archive
These projects have been archived and are generally unsupported, but are still available to view and use
Meta Archive
Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

null 4 Mar 11, 2022
PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition"

R2Plus1D-PyTorch PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal

Irhum Shafkat 342 Dec 16, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

Keon Lee 152 Jan 2, 2023
torchbearer: A model fitting library for PyTorch

Note: We're moving to PyTorch Lightning! Read about the move here. From the end of February, torchbearer will no longer be actively maintained. We'll

null 631 Jan 4, 2023
torchbearer: A model fitting library for PyTorch

Note: We're moving to PyTorch Lightning! Read about the move here. From the end of February, torchbearer will no longer be actively maintained. We'll

null 632 Dec 13, 2022
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
Python implementation of 3D facial mesh exaggeration using the techniques described in the paper: Computational Caricaturization of Surfaces.

Python implementation of 3D facial mesh exaggeration using the techniques described in the paper: Computational Caricaturization of Surfaces.

Wonjong Jang 8 Nov 1, 2022
Joint parameterization and fitting of stroke clusters

StrokeStrip: Joint Parameterization and Fitting of Stroke Clusters Dave Pagurek van Mossel1, Chenxi Liu1, Nicholas Vining1,2, Mikhail Bessmeltsev3, Al

Dave Pagurek 44 Dec 1, 2022
A pure PyTorch implementation of the loss described in "Online Segment to Segment Neural Transduction"

ssnt-loss ℹ️ This is a WIP project. the implementation is still being tested. A pure PyTorch implementation of the loss described in "Online Segment t

張致強 1 Feb 9, 2022
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 458 Jan 2, 2023
Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley 44 Nov 4, 2022
Generative Query Network (GQN) in PyTorch as described in "Neural Scene Representation and Rendering"

Update 2019/06/24: A model trained on 10% of the Shepard-Metzler dataset has been added, the following notebook explains the main features of this mod

Jesper Wohlert 313 Dec 27, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
Voice assistant - Voice assistant with python

?? Python Voice Assistant ?? - User's greeting ?? - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
This is a package for LiDARTag, described in paper: LiDARTag: A Real-Time Fiducial Tag System for Point Clouds

LiDARTag Overview This is a package for LiDARTag, described in paper: LiDARTag: A Real-Time Fiducial Tag System for Point Clouds (PDF)(arXiv). This wo

University of Michigan Dynamic Legged Locomotion Robotics Lab 159 Dec 21, 2022
An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"

RASP Setup Mac or Linux Run ./setup.sh . It will create a python3 virtual environment and install the dependencies for RASP. It will also try to insta

null 141 Jan 3, 2023
Minimalistic PyTorch training loop

Backbone for PyTorch training loop Will try to keep it minimalistic. pip install back from back import Bone Features Progress bar Checkpoints saving/l

Kashin 4 Jan 16, 2020