An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Facebook Research

Last update: Jan 6, 2023

Related tags

Deep Learning speech-resynthesis

Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.

Quick Links

Samples
Setup
Training
Inference

Setup

Software

Requirements:

Python >= 3.6
PyTorch v1.8

Install dependencies

git clone https://github.com/facebookresearch/speech-resynthesis.git
cd speech-resynthesis
pip install -r requirements.txt

Data

For LJSpeech:

Download LJSpeech dataset from here into data/LJSpeech-1.1 folder.

Downsample audio from 22.05 kHz to 16 kHz and pad

bash
python ./scripts/preprocess.py \
--srcdir data/LJSpeech-1.1/wavs \
--outdir data/LJSpeech-1.1/wavs_16khz \
--pad

For VCTK:

Download VCTK dataset from here into data/VCTK-Corpus folder.

Downsample audio from 48 kHz to 16 kHz, trim trailing silences and pad

python ./scripts/preprocess.py \
--srcdir data/VCTK-Corpus/wav48 \
--outdir data/VCTK-Corpus/wav16 \
--trim --pad

Training

F0 Quantizer Model

To train F0 quantizer model, use the following command:

python -m torch.distributed.launch --nproc_per_node 8 train_f0_vq.py \
--checkpoint_path checkpoints/lj_f0_vq \
--config configs/LJSpeech/f0_vqvae.json

Set <NUM_GPUS> to the number of availalbe GPUs on your machine.

Resynthesis Model

To train a resynthesis model, use the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/lj_vqvae \
--config configs/LJSpeech/vqvae256_lut.json

Supported Configurations

Currently, we support the following training schemes:

Dataset	SSL Method	Dictionary Size	Config Path
LJSpeech	HuBERT	100	`configs/LJSpeech/hubert100_lut.json`
LJSpeech	CPC	100	`configs/LJSpeech/cpc100_lut.json`
LJSpeech	VQVAE	256	`configs/LJSpeech/vqvae256_lut.json`
VCTK	HuBERT	100	`configs/VCTK/hubert100_lut.json`
VCTK	CPC	100	`configs/VCTK/cpc100_lut.json`
VCTK	VQVAE	256	`configs/VCTK/vqvae256_lut.json`

Inference

To generate, simply run:

python inference.py \
--checkpoint_file checkpoints/0 \
-n 10 \
--output_dir generations

To synthesize multiple speakers:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--vc \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_multispkr

You can also generate with codes from a different dataset:

python inference.py \
--checkpoint_file checkpoints/lj_cpc100 \
-n 10 \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_vctk_to_lj

Preprocessing New Datasets

CPC / HuBERT Coding

To quantize new datasets with CPC or HuBERT follow the instructions described in the GSLM code.

To parse CPC output:

python scripts/parse_cpc_codes.py \
--manifest cpc_output_file \
--wav-root wav_root_dir \
--outdir parsed_cpc

To parse HuBERT output:

python parse_hubert_codes.py \
--codes hubert_output_file \
--manifest hubert_tsv_file \
--outdir parsed_hubert

VQVAE Coding

First, you will need to download LibriLight dataset and move it to data/LibriLight.

For VQVAE, train a vqvae model using the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/ll_vq \
--config configs/LibriLight/vqvae256.json

To extract VQVAE codes:

python infer_vqvae_codes.py \
--input_dir folder_with_wavs_to_code \
--output_dir vqvae_output_folder \
--checkpoint_file checkpoints/ll_vq

To parse VQVAE output:

 python parse_vqvae_codes.py \
 --manifest vqvae_output_file \
 --outdir parsed_vqvae

License

You may find out more about the license here.

Citation

@inproceedings{polyak21_interspeech,
  author={Adam Polyak and Yossi Adi and Jade Copet and 
          Eugene Kharitonov and Kushal Lakhotia and 
          Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
  title={{Speech Resynthesis from Discrete Disentangled Self-Supervised Representations}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
}

Acknowledgements

This implementation uses code from the following repos: HiFi-GAN and Jukebox, as described in our code.

Comments

For VQ-VAE coding at 800 bps, the "small" and "medium" audio about 6000 hours were used for training?

In the paper, "The VQ-VAE model employs the HiFiGAN decoder trained on the LibriLight dataset to match the amount of data reported in [34]." How many hours of LibriLight were used in the training?

opened by xiaoli1996 6
Running yaapt on-the-fly extremely slows the training
Hi, thanks for kindly releasing the code for the paper. (Also congratulations on the acceptance in INTERSPEECH!)

While I was running the code, I encountered a significant issue - pYAAPT.yaapt extremely slow the training. Here's how I found out such a bottleneck on speed:

I tried to run train_f0_vq.py as specified in README.

However, training was too slow; looks like we need to train an f0 vq model for 400000 steps, but a single epoch (about 700 steps) took 2657 seconds to run. GPU util was really low, and CPUs were running like crazy. (My server has 3080 Ti with 64 CPU cores.)

I suspected pYAAPT.yaapt to be a cause for this. To test that, I forked a repository to add a caching functionality: https://github.com/seungwonpark/speech-resynthesis

After that, a single epoch after the first epoch (for an initial caching) took only 36 seconds.

So my question is, how did you manage to run yaapt on-the-fly without caching? Though I succeeded in training the model fast enough, I shall need to disable caching again since it requires the _sample_interval method to sample the same interval for each audio (i.e. disabling the data augmentation via randomly choosing the interval).
opened by seungwonpark 4
VCTK dataset v0.92 not compatible with current training pipeline/scripts

The link to VCTK dataset mentioned in the README (https://datashare.ed.ac.uk/handle/10283/3443) points to version 0.92, which content doesn't correspond to the file paths provided in datasets/VCTK/cpc100/train.txt (VCTK 0.92 contains 2 mic recordings mic1, mic2 in Flac format. which mic should we use? or should we combine them?)

opened by slegroux 3
The problem about the pre-processing of the VCTK dataset

Hello, Could you help me about the pre-processing of the VCTK data set？ I go to the link CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

And put all the data into the path: ./data/VCTK-Corpus

There is no wav48 and wav16 folder in the file being downloaded. Did I download the wrong file or did I do something wrong?

Would i ask for help? I will be grateful for any help you can provide.

opened by chullin 1
Bigger speech2unit Hubert versions

Hi,

I was just wondering if you have tried hubert-large or hubert-xtralarge as alternatives to hubert-base the speech2unit. First, I tried to train a kmeans for the hubert-base and retrain the vocoder part to see if I can replicate the results with the pretrained kmeans, but the results that I get are worse. I would very much appreciate if you either released the pretrained kmeans for hubert-large or hubert-xtralarge (if you have it), or gave me some guidelines to try to replicate your results. Specifically, I want to know the amount of kmeans iterations, the amount of centroids, the layer of hubert, and the batch_size used. Currently, I'm training the kmeans with 150 iterations, 100 centroids, the 6th layer outputs, and a batch_size of 10000, but I don't know if these parameters are correct.

Thank you in advance

opened by AlexFuster 1
Coding new dataset for training

Hello!

I read the corresponding section in the README, and understand that I need to download the LibriLight dataset to train the VQVAE. I downloaded the small.tar file, but open unzipping I don't see files with the path like /checkpoint/pem/morgane/LibriBig/3717/9/3717_3120_9_0021.wav, but something like LibriLight/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb.flac. Am I downloading the correct LibriLight? If not, what can I do?

Thank you!

opened by siyan-sylvia-li 0

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Related tags

Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Quick Links

Setup

Software

Data

For LJSpeech:

For VCTK:

Training

F0 Quantizer Model

Resynthesis Model

Supported Configurations

Inference

Preprocessing New Datasets

CPC / HuBERT Coding

VQVAE Coding

License

Citation

Acknowledgements

Comments

For VQ-VAE coding at 800 bps, the "small" and "medium" audio about 6000 hours were used for training?

Running yaapt on-the-fly extremely slows the training

VCTK dataset v0.92 not compatible with current training pipeline/scripts

The problem about the pre-processing of the VCTK dataset

Bigger speech2unit Hubert versions

Coding new dataset for training

Owner

Facebook Research

The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Official PyTorch implementation of BlobGAN: Spatially Disentangled Scene Representations

PyTorch implementation of: Michieli U. and Zanuttigh P., "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations", CVPR 2021.

Code for CVPR2021 paper 'Where and What? Examining Interpretable Disentangled Representations'.

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Official implementation of the method ContIG, for self-supervised learning from medical imaging with genomics

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"

Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"