Autoregressive Predictive Coding: An unsupervised autoregressive model for speech representation learning

Overview

Autoregressive Predictive Coding

This repository contains the official implementation (in PyTorch) of Autoregressive Predictive Coding (APC) proposed in An Unsupervised Autoregressive Model for Speech Representation Learning.

APC is a speech feature extractor trained on a large amount of unlabeled data. With an unsupervised, autoregressive training objective, representations learned by APC not only capture general acoustic characteristics such as speaker and phone information from the speech signals, but are also highly accessible to downstream models--our experimental results on phone classification show that a linear classifier taking the APC representations as the input features significantly outperforms a multi-layer percepron using the surface features.

Dependencies

  • Python 3.5
  • PyTorch 1.0

Dataset

In the paper, we used the train-clean-360 split from the LibriSpeech corpus for training the APC models, and the dev-clean split for keeping track of the training loss. We used the log Mel spectrograms, which were generated by running the Kaldi scripts, as the input acoustic features to the APC models. Of course you can generate the log Mel spectrograms yourself, but to help you better reproduce our results, here we provide the links to the data proprocessed by us that can be directly fed to the APC models. We also include other data splits that we did not use in the paper for you to explore, e.g., you can try training an APC model on a larger and nosier set (e.g., train-other-500) and see if it learns more robust speech representations.

Training APC

Below we will follow the paper and use train-clean-360 and dev-clean as demonstration. Once you have downloaded the data, unzip them by running:

xz -d train-clean-360.xz
xz -d dev-clean.xz

Then, create a directory librispeech_data/kaldi and move the data into it:

mkdir -p librispeech_data/kaldi
mv train-clean-360-hires-norm.blogmel librispeech_data/kaldi
mv dev-clean-hires-norm.blogmel librispeech_data/kaldi

Now we will have to transform the data into the format loadable by the PyTorch DataLoader. To do so, simply run:

# Prepare the training set
python prepare_data.py --librispeech_from_kaldi librispeech_data/kaldi/train-clean-360-hires-norm.blogmel --save_dir librispeech_data/preprocessed/train-clean-360-hires-norm.blogmel
# Prepare the valication set
python prepare_data.py --librispeech_from_kaldi librispeech_data/kaldi/dev-clean-hires-norm.blogmel --save_dir librispeech_data/preprocessed/dev-clean-hires-norm-blogmel

Once the program is done, you will see a directory preprocessed/ inside librispeech_data/ that contains all the preprocessed PyTorch tensors.

To train an APC model, simply run:

python train_apc.py

By default, the trained models will be put in logs/. You can also use Tensorboard to trace the training progress. There are many other configurations you can try, check train_apc.py for more details--it is highly documented and should be self-explanatory.

Feature extraction

Once you have trained your APC model, you can use it to extract speech features from your target dataset. To do so, feed-forward the trained model on the target dataset and retrieve the extracted features by running:

_, feats = model.forward(inputs, lengths)

feats is a PyTorch tensor of shape (num_layers, batch_size, seq_len, rnn_hidden_size) where:

  • num_layers is the RNN depth of your APC model
  • batch_size is your inference batch size
  • seq_len is the maximum sequence length and is determined when you run prepare_data.py. By default this value is 1600.
  • rnn_hidden_size is the dimensionality of the RNN hidden unit.

As you can see, feats is essentially the RNN hidden states in an APC model. You can think of APC as a speech version of ELMo if you are familiar with it.

There are many ways to incorporate feats into your downstream task. One of the easiest way is to take only the outputs of the last RNN layer (i.e., feats[-1, :, :, :]) as the input features to your downstream model, which is what we did in our paper. Feel free to explore other mechanisms.

Pre-trained models

We release the pre-trained models that were used to produce the numbers reported in the paper. load_pretrained_model.py provides a simple example of loading a pre-trained model.

Reference

Please cite our paper(s) if you find this repository useful. This first paper proposes the APC objective, while the second paper applies it to speech recognition, speech translation, and speaker identification, and provides more systematic analysis on the learned representations. Cite both if you are kind enough!

@inproceedings{chung2019unsupervised,
  title = {An unsupervised autoregressive model for speech representation learning},
  author = {Chung, Yu-An and Hsu, Wei-Ning and Tang, Hao and Glass, James},
  booktitle = {Interspeech},
  year = {2019}
}
@inproceedings{chung2020generative,
  title = {Generative pre-training for speech with autoregressive predictive coding},
  author = {Chung, Yu-An and Glass, James},
  booktitle = {ICASSP},
  year = {2020}
}

Contact

Feel free to shoot me an email for any inquiries about the paper and this repository.

Comments
  • `blogmel` files are not available

    `blogmel` files are not available

    Hi, There's a problem with the feature files.

    After downloading the train-clean-360.xz and dev-clean.xzvia dropbox link on the README.md, I've tried to unzip them

    But xz -d just produced train-clean-360 (no file extensions), which is not readable with numpy or pickle.

    Can I get the feature files?

    Thank you :)

    opened by simon-rtzr 5
  • Replace kaldi feature with torchaudio fbank feature?

    Replace kaldi feature with torchaudio fbank feature?

    Hello, I'm currently trying APC, but I'm not familiar with kaldi, and I found that torchaudio provide fbank function that match Kaldi’s compute-fbank-feat. I'm wondering is it possible to use fbank for creating the same feature?If so, how can I set the parameter? image

    opened by toco2270853 2
  • Validation Loss

    Validation Loss

    Hi, do you have validation loss scores for all models (n=1,...n=20) on libri-valid-clean? I want to verify whether my results are correct. I created a test script and loaded your pre-trained models.

    I got these validation losses: n=1, 0.30188 n=3, 0.50266

    Thank you in advance.

    opened by gentaiscool 2
  • Will TransformerAPC and ASR test code be released?

    Will TransformerAPC and ASR test code be released?

    Hi, Yu-An Chung, Thank you for releasing your work on APC. I am doing reseach on topic music information retrieval and think your work may helpful for us. I found that Transformer-APC is not included in current code version. So as to the test part for the ASR test experiment in paper GENERATIVE PRE-TRAINING FOR SPEECH WITH AUTOREGRESSIVE PREDICTIVE CODING. I wonder whether you will have plan to release them out in the future? Hope to get your reply.

    opened by xinedison 2
  • Could you please share the preprocessing parameters such as window and hop lengths?

    Could you please share the preprocessing parameters such as window and hop lengths?

    Hi, Thank you for sharing your code. Great work ! Since I think preprocessing parameters like hop length and window length are important for APC. Also I want to do experiments on other dataset. So could you please share them, or share a preprocessing script ?

    opened by Aria-K-Alethia 1
  • Preprocessing script

    Preprocessing script

    Hi, can you also share your preprocessing script to generate 80-dimensional log Mel spectrograms? Did you apply the same script to both Librispeech and WSJ? Thank you

    opened by gentaiscool 1
  • loss don't convergence

    loss don't convergence

    I trained APC by myself data process , but it seem don't convergence in train. and your dataset is too slow to download . I want to know what's the value of loss when you stop your training ?thanks

    opened by hyx100e 1
  • Sample rate & license

    Sample rate & license

    Hi! Thanks for the great lib and interesting paper.

    What sample rate did you use for loading audio and is this important when computing log mel specs? Also what is model license?

    Thank you kindly :)

    opened by torphix 0
  • Any Plans to release code of T-APC?

    Any Plans to release code of T-APC?

    I find that the T-APC (transformer-based version APC in GENERATIVE PRE-TRAINING FOR SPEECH WITH AUTOREGRESSIVE PREDICTIVE CODING) is not included in this repository currently. Any plans to release the T-APC code? THX!

    opened by YuanxunLu 0
Owner
iamyuanchung
Natural language & speech processing researcher
iamyuanchung
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 6, 2022
Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

null 3 May 19, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Viewmaker Networks: Learning Views for Unsupervised Representation Learning Alex Tamkin, Mike Wu, and Noah Goodman Paper link: https://arxiv.org/abs/2

Alex Tamkin 31 Dec 1, 2022
CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

XiaoMing 5 Aug 19, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Real-Time Multi-Contact Model Predictive Control via ADMM

Here, you can find the code for the paper 'Real-Time Multi-Contact Model Predictive Control via ADMM'. Code is currently being cleared up and optimize

null 17 Dec 28, 2022
EfficientMPC - Efficient Model Predictive Control Implementation

efficientMPC Efficient Model Predictive Control Implementation The original algo

Vin 8 Dec 4, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution

DASR Pytorch implementation of "Unsupervised Degradation Representation Learning for Blind Super-Resolution", CVPR 2021 [arXiv] Overview Requirements

Longguang Wang 318 Dec 24, 2022
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
code for paper "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?"

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? Code for paper: Does Unsupervised Architecture Representation

null 39 Dec 17, 2022
This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Learning Invariant Representation for Unsupervised Image Restoration (CVPR 2020) Introduction This is an implementation for the paper "Learning Invari

GarField 88 Nov 7, 2022
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

ISL This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation, which is accepted

null 19 May 4, 2022
[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

DSM The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion Project Website; Datasets li

Jinpeng Wang 114 Oct 16, 2022
[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images

Unsupervised Object-Level Representation Learning from Scene Images This repository contains the official PyTorch implementation of the ORL algorithm

Jiahao Xie 55 Dec 3, 2022
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 5, 2022
This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

LEAP Lab 2 Sep 15, 2022