Conformer ASR
A minimal Conformer ASR implementation adapted from ESPnet.
Introduction
I want to use the pre-trained English ASR model provided by ESPnet. However, ESPnet is relatively heavy for me. So here I try to extract only the conformer ASR part from ESPnet so that I can do better customization. Let's do it.
There are bunch of models available for ASR listed here. I choose the one with name:
kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave
Its performance can be found [here](https://zenodo.org/record/4604066#.YbxsX5FByV4), toggle me to see.
- WER
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_asr_model_valid.acc.ave/dev_clean | 2703 | 54402 | 97.9 | 1.9 | 0.2 | 0.2 | 2.3 | 28.6 |
decode_asr_asr_model_valid.acc.ave/dev_other | 2864 | 50948 | 94.5 | 5.1 | 0.5 | 0.6 | 6.1 | 48.3 |
decode_asr_asr_model_valid.acc.ave/test_clean | 2620 | 52576 | 97.7 | 2.1 | 0.2 | 0.3 | 2.6 | 31.4 |
decode_asr_asr_model_valid.acc.ave/test_other | 2939 | 52343 | 94.7 | 4.9 | 0.5 | 0.7 | 6.0 | 49.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean | 2703 | 54402 | 98.3 | 1.5 | 0.2 | 0.2 | 1.9 | 25.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other | 2864 | 50948 | 95.8 | 3.7 | 0.4 | 0.5 | 4.6 | 40.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean | 2620 | 52576 | 98.1 | 1.7 | 0.2 | 0.3 | 2.1 | 26.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other | 2939 | 52343 | 95.8 | 3.7 | 0.5 | 0.5 | 4.7 | 42.4 |
- CER
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_asr_model_valid.acc.ave/dev_clean | 2703 | 288456 | 99.4 | 0.3 | 0.2 | 0.2 | 0.8 | 28.6 |
decode_asr_asr_model_valid.acc.ave/dev_other | 2864 | 265951 | 98.0 | 1.2 | 0.8 | 0.7 | 2.7 | 48.3 |
decode_asr_asr_model_valid.acc.ave/test_clean | 2620 | 281530 | 99.4 | 0.3 | 0.3 | 0.3 | 0.9 | 31.4 |
decode_asr_asr_model_valid.acc.ave/test_other | 2939 | 272758 | 98.2 | 1.0 | 0.7 | 0.7 | 2.5 | 49.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean | 2703 | 288456 | 99.5 | 0.3 | 0.2 | 0.2 | 0.7 | 25.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other | 2864 | 265951 | 98.3 | 1.0 | 0.7 | 0.5 | 2.2 | 40.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean | 2620 | 281530 | 99.5 | 0.3 | 0.3 | 0.2 | 0.7 | 26.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other | 2939 | 272758 | 98.5 | 0.8 | 0.7 | 0.5 | 2.1 | 42.4 |
- TER
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_asr_model_valid.acc.ave/dev_clean | 2703 | 68010 | 97.5 | 1.9 | 0.7 | 0.4 | 2.9 | 28.6 |
decode_asr_asr_model_valid.acc.ave/dev_other | 2864 | 63110 | 93.4 | 5.0 | 1.6 | 1.0 | 7.6 | 48.3 |
decode_asr_asr_model_valid.acc.ave/test_clean | 2620 | 65818 | 97.2 | 2.0 | 0.8 | 0.4 | 3.3 | 31.4 |
decode_asr_asr_model_valid.acc.ave/test_other | 2939 | 65101 | 93.7 | 4.5 | 1.8 | 0.9 | 7.2 | 49.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean | 2703 | 68010 | 97.8 | 1.5 | 0.7 | 0.3 | 2.5 | 25.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other | 2864 | 63110 | 94.6 | 3.8 | 1.6 | 0.7 | 6.1 | 40.0 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean | 2620 | 65818 | 97.6 | 1.6 | 0.8 | 0.3 | 2.7 | 26.2 |
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other | 2939 | 65101 | 94.7 | 3.5 | 1.8 | 0.7 | 6.0 | 42.4 |
ASR step by step
1. Setup code
pip install .
2. Download the model and unzip it
wget https://zenodo.org/record/4604066/files/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave.zip?download=1 -o conformer.zip
unzip conformer.zip
3. Run an example
import torch
import librosa
from mmds.utils.spectrogram import MelSpectrogram
from conformer_asr import Conformer, Tokenizer
sample_rate = 16000
cfg_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/config.yaml"
bpe_path = "./data/en_unnorm_token_list/bpe_unigram5000/bpe.model"
ckpt_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/valid.acc.ave_10best.pth"
tokenizer = Tokenizer(cfg_path, bpe_path)
conformer = Conformer(tokenizer, ckpt_path=ckpt_path)
conformer.eval()
spec_fn = MelSpectrogram(
sample_rate,
hop_length=256,
f_min=0,
f_max=8000,
win_length=512,
power=2,
)
w0, _ = librosa.load("./example.m4a", sample_rate)
w0 = torch.from_numpy(w0)
m0 = spec_fn(w0).t()
l = len(m0)
# create batch with different length audio (yes, supported)
x = [m0, m0[: l // 2], m0[: l // 4]]
ref = "This is a test video for youtube-dl. For more information, contact [email protected]".lower()
hyps = conformer.decode(x, beam_width=20)
print("REF", ref)
for hyp in hyps:
print("HYP", hyp.lower())
- Results
REF this is a test video for youtube-dl. for more information, contact [email protected]
HYP this is a test video for you do bl for more information -- contact the hih aging at the hihaging, not the
HYP this is a test for you d bl for more information
HYP this is a testim for you to
Features
Supported
- Batched decoding
Not supported yet
- Transformer language model
- Other checkpoints