A minimal Conformer ASR implementation adapted from ESPnet.

Niu Zhe

Last update: Jan 24, 2022

Related tags

Overview

Conformer ASR

A minimal Conformer ASR implementation adapted from ESPnet.

Introduction

I want to use the pre-trained English ASR model provided by ESPnet. However, ESPnet is relatively heavy for me. So here I try to extract only the conformer ASR part from ESPnet so that I can do better customization. Let's do it.

There are bunch of models available for ASR listed here. I choose the one with name:

kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave

Its performance can be found [here](https://zenodo.org/record/4604066#.YbxsX5FByV4), toggle me to see.

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	54402	97.9	1.9	0.2	0.2	2.3	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	50948	94.5	5.1	0.5	0.6	6.1	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	52576	97.7	2.1	0.2	0.3	2.6	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	52343	94.7	4.9	0.5	0.7	6.0	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	54402	98.3	1.5	0.2	0.2	1.9	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	50948	95.8	3.7	0.4	0.5	4.6	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	52576	98.1	1.7	0.2	0.3	2.1	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	52343	95.8	3.7	0.5	0.5	4.7	42.4

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	288456	99.4	0.3	0.2	0.2	0.8	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	265951	98.0	1.2	0.8	0.7	2.7	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	281530	99.4	0.3	0.3	0.3	0.9	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	272758	98.2	1.0	0.7	0.7	2.5	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	288456	99.5	0.3	0.2	0.2	0.7	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	265951	98.3	1.0	0.7	0.5	2.2	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	281530	99.5	0.3	0.3	0.2	0.7	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	272758	98.5	0.8	0.7	0.5	2.1	42.4

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean	2703	68010	97.5	1.9	0.7	0.4	2.9	28.6
decode_asr_asr_model_valid.acc.ave/dev_other	2864	63110	93.4	5.0	1.6	1.0	7.6	48.3
decode_asr_asr_model_valid.acc.ave/test_clean	2620	65818	97.2	2.0	0.8	0.4	3.3	31.4
decode_asr_asr_model_valid.acc.ave/test_other	2939	65101	93.7	4.5	1.8	0.9	7.2	49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean	2703	68010	97.8	1.5	0.7	0.3	2.5	25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other	2864	63110	94.6	3.8	1.6	0.7	6.1	40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean	2620	65818	97.6	1.6	0.8	0.3	2.7	26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other	2939	65101	94.7	3.5	1.8	0.7	6.0	42.4

ASR step by step

1. Setup code

pip install .

2. Download the model and unzip it

wget https://zenodo.org/record/4604066/files/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave.zip?download=1 -o conformer.zip
unzip conformer.zip

3. Run an example

import torch
import librosa
from mmds.utils.spectrogram import MelSpectrogram
from conformer_asr import Conformer, Tokenizer

sample_rate = 16000
cfg_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/config.yaml"
bpe_path = "./data/en_unnorm_token_list/bpe_unigram5000/bpe.model"
ckpt_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/valid.acc.ave_10best.pth"

tokenizer = Tokenizer(cfg_path, bpe_path)
conformer = Conformer(tokenizer, ckpt_path=ckpt_path)
conformer.eval()

spec_fn = MelSpectrogram(
    sample_rate,
    hop_length=256,
    f_min=0,
    f_max=8000,
    win_length=512,
    power=2,
)

w0, _ = librosa.load("./example.m4a", sample_rate)
w0 = torch.from_numpy(w0)
m0 = spec_fn(w0).t()

l = len(m0)

# create batch with different length audio (yes, supported)
x = [m0, m0[: l // 2], m0[: l // 4]]

ref = "This is a test video for youtube-dl. For more information, contact [email protected]".lower()
hyps = conformer.decode(x, beam_width=20)

print("REF", ref)
for hyp in hyps:
    print("HYP", hyp.lower())

Results

REF this is a test video for youtube-dl. for more information, contact [email protected]
HYP this is a test video for you do bl for more information -- contact the hih aging at the hihaging, not the
HYP this is a test for you d bl for more information
HYP this is a testim for you to

Features

Supported

Batched decoding

Not supported yet

Transformer language model
Other checkpoints

A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

7 Nov 15, 2022

SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

63 Dec 25, 2022

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

3k Jan 6, 2023

Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

1.9k Jan 6, 2023

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

1.9k Feb 18, 2021

A minimal Conformer ASR implementation adapted from ESPnet.

Related tags

Overview

Conformer ASR

Introduction

ASR step by step

1. Setup code

2. Download the model and unzip it

3. Run an example

Features

Supported

Not supported yet

You might also like...

A minimal code for fairseq vq-wav2vec model inference.

SAINT PyTorch implementation

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Python implementation of TextRank for phrase extraction and summarization of text documents

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Python implementation of TextRank for phrase extraction and summarization of text documents

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Owner

Niu Zhe

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Model for recasing and repunctuating ASR transcripts

Maix Speech AI lib, including ASR, chat, TTS etc.

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

A demo of chinese asr

An end to end ASR Transformer model training repo

Paddlespeech Streaming ASR GUI

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Minimal GUI for accessing the Watson Text to Speech service.