Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

ASAPP Research

Last update: Dec 1, 2022

Related tags

Text Data & NLP sew

Overview

SEW (Squeezed and Efficient Wav2vec)

The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition" by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q Weinberger, and Yoav Artzi.

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Model	Pre-training updates	Dataset	Model
W2V2-tiny	100K	Librispeech 960h	download
W2V2-small	100K	Librispeech 960h	download
W2V2-mid	100K	Librispeech 960h	download
W2V2-base	100K	Librispeech 960h	download
SEW-tiny	100K	Librispeech 960h	download
SEW-small	100K	Librispeech 960h	download
SEW-mid	100K	Librispeech 960h	download
SEW-D-tiny	100K	Librispeech 960h	download
SEW-D-small	100K	Librispeech 960h	download
SEW-D-mid	100K	Librispeech 960h	download
SEW-D-mid (k127)	100K	Librispeech 960h	download
SEW-D-base	100K	Librispeech 960h	download
SEW-D-base+	100K	Librispeech 960h	download
SEW-D-mid	400K	Librispeech 960h	download
SEW-D-mid (k127)	400K	Librispeech 960h	download
SEW-D-base+	400K	Librispeech 960h	download

ASR model fine-tuned on LibriSpeech train-clean 100h

Model	Pre-training updates	Finetuning split	Model
SEW-tiny	100K	100h	download
SEW-D-tiny	100K	100h	download
SEW-D-mid	400K	100h	download
SEW-D-mid (k127)	400K	100h	download
SEW-D-base+	400K	100h	download

Usage

Dependencies

The code is tested with fairseq commit 05255f9, deberta commit bf17ca4 and the following packages.

torch==1.8.0
torchaudio==0.8.0
tqdm==4.49.0
Hydra==2.5
hydra-core==1.0.4
fvcore==0.1.5.post20210330
omegaconf==2.0.5
einops==0.3.0
fire==0.2.1

Apex

Please install NVIDIA's apex with

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

wav2letter decoder

Currently, we are decoding with wav2letter v0.2 python binding at commit 96f5f9d Please install the python binding here https://github.com/flashlight/wav2letter/tree/96f5f9d3b41e01af0a031ee0d2604acd9ef3b1b0/bindings/python The newest commit d5a93f0 in v0.2 branch leads to worse WER for wav2vec 2.0 baselines.

Installation

git clone https://github.com/asappresearch/sew.git
cd sew 
pip install -e .

Pre-training

Pre-training SEW models

Run the following command where $model_size can be tiny, small, or mid, and $ngpu is tne number of GPUs you want to use.

bash scripts/pt-sew.sh $model_size $ngpu

Pre-training SEW-D models

bash scripts/pt-sew-d.sh $model_size $ngpu

where $model_size can be tiny, small, mid, mid-k127, base, or base+.

Fine-tuning

Run the following script to fine-tune a model with the hyperparameters from wav2vec 2.0.

bash scripts/ft-model.sh $pre_trained_model $split $ngpu

where $pre_trained_model can be either a W2V2, SEW, or a SEW-D model checkpoint and $split can be 10m, 1h, 10h, or 100h.

Here we also provide a set of hyperparameters which sets all dropouts the same as the pre-training stage, and we found it to be more stable.

bash scripts/ft-model-stable.sh $pre_trained_model $split $ngpu

If you see out of GPU memory error, please scale down the dataset.max_tokens and scale up the optimization.update_freq in scripts/ft-model.sh. For example modifying these lines

  dataset.max_tokens=3200000 \
  optimization.update_freq="[$((8 / $ngpu))]" \

  dataset.max_tokens=1600000 \
  optimization.update_freq="[$((16 / $ngpu))]" \

which reduces the batch size and increases the gradient accumulation steps in order to use less GPU memory.

Evaluation

Please run this script to prepare the official LibriSpeech 4-gram language model.

bash scripts/prepare_librispeech_lm.sh $kenlm_build_bin

where $kenlm_build_bin is the folder that contains the KenLM build_binary executable file (e.g. /home/user/kenlm/build/bin).

Then run this script to evaluate a pre-trained ASR model

python tools/eval_w2v.py tunelm --subsets '["dev-clean", "dev-other", "test-clean", "test-other"]' --model $asr_checkpoint

You might also like...

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

1 Apr 8, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.5k Dec 5, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Comments

8000 sample rate audio

Hello there,

I'm trying to train on 8000 Hz sample rate audio dataset. Is it enough to simply add task.sample_rate=8000 to the fairseq command or there are additional config changes that I should make?

I would much appreciate any advice

Thank you

opened by Mega4alik 0
How to train using not English Languages

Hi! Thank you for the awesome model!

We are very interested in your project and we try to use the sew for Japanese Language. When we train the model, should we use these scripts? Thanks! https://github.com/asappresearch/sew/tree/master/scripts

opened by jigenji 1
:bug: Fix padding mask calculation

This PR updates the padding mask calculation to be the same as the one in the reference Wav2Vec2 implementation (same commit as listed in SEW's README): https://github.com/pytorch/fairseq/blob/05255f96410e5b1eaf3bf59b767d5b4b7e2c3a35/fairseq/models/wav2vec/wav2vec2.py#L477

For more details on how and why it was fixed in fairseq, check out this PR by @patrickvonplaten https://github.com/pytorch/fairseq/pull/3228

opened by anton-l 0

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Related tags

Overview

SEW (Squeezed and Efficient Wav2vec)

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

ASR model fine-tuned on LibriSpeech train-clean 100h

Usage

Dependencies

Apex

wav2letter decoder

Installation

Pre-training

Fine-tuning

Evaluation

You might also like...

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Simple Speech to Text, Text to Speech

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Unsupervised intent recognition

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Comments

8000 sample rate audio

How to train using not English Languages

:bug: Fix padding mask calculation

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

Owner

ASAPP Research

Must-read papers on improving efficiency for pre-trained language models.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Speech Recognition for Uyghur using Speech transformer

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)