Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Overview

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Official implementation of the Efficient Conformer, progressively downsampled Conformer with grouped attention for Automatic Speech Recognition.

Efficient Conformer Encoder

Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

Installation

Clone GitHub repository and set up environment

git clone https://github.com/burchim/EfficientConformer.git
cd EfficientConformer
pip install -r requirements.txt

Install ctcdecode

Download LibriSpeech

Librispeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

cd datasets
./download_LibriSpeech.sh

Running an experiment

You can run an experiment by providing a config file using the '--config_file' flag. Training checkpoints and logs will be saved in the callback folder specified in the config file. Note that '--prepare_dataset' and '--create_tokenizer' flags may be needed for your first experiment.

python main.py --config_file configs/config_file.json

Evaluation

Models can be evaluated by selecting a subset validation/test mode and by providing the epoch/name of the checkpoint to load for evaluation with the '--initial_epoch' flag. The '--gready' flag designates whether to use gready search or beam search decoding for evaluation.

python main.py --config_file configs/config_file.json --initial_epoch epoch/name --mode validation/test --gready

Options

-c / --config_file		type=str   default="configs/EfficientConformerCTCSmall.json"	help="Json configuration file containing model hyperparameters"
-m / --mode                	type=str   default="training"                               	help="Mode : training, validation-clean, test-clean, eval_time-dev-clean, ..."
-d / --distributed         	action="store_true"                                            	help="Distributed data parallelization"
-i / --initial_epoch  		type=str   default=None                                       	help="Load model from checkpoint"
--initial_epoch_lm         	type=str   default=None                                       	help="Load language model from checkpoint"
--initial_epoch_encoder    	type=str   default=None                                       	help="Load model encoder from encoder checkpoint"
-p / --prepare_dataset		action="store_true"                                            	help="Prepare dataset before training"
-j / --num_workers        	type=int   default=8                                          	help="Number of data loading workers"
--create_tokenizer         	action="store_true"                                            	help="Create model tokenizer"
--batch_size_eval      		type=int   default=8                                          	help="Evaluation batch size"
--verbose_val              	action="store_true"                                            	help="Evaluation verbose"
--val_steps                	type=int   default=None                                       	help="Number of validation steps"
--steps_per_epoch      		type=int   default=None                                       	help="Number of steps per epoch"
--world_size               	type=int   default=torch.cuda.device_count()                  	help="Number of available GPUs"
--cpu                      	action="store_true"                                            	help="Load model on cpu"
--show_dict            		action="store_true"                                            	help="Show model dict summary"
--swa                      	action="store_true"                                            	help="Stochastic weight averaging"
--swa_epochs               	nargs="+"   default=None                                       	help="Start epoch / end epoch for swa"
--swa_epochs_list      		nargs="+"   default=None                                       	help="List of checkpoints epochs for swa"
--swa_type                   	type=str   default="equal"                                    	help="Stochastic weight averaging type (equal/exp)"
--parallel                   	action="store_true"                                            	help="Parallelize model using data parallelization"
--rnnt_max_consec_dec_steps  	type=int   default=None                                       	help="Number of maximum consecutive transducer decoder steps during inference"
--eval_loss                  	action="store_true"                                            	help="Compute evaluation loss during evaluation"
--gready                     	action="store_true"                                            	help="Proceed to a gready search evaluation"
--saving_period              	type=int   default=1                                          	help="Model saving every 'n' epochs"
--val_period                 	type=int   default=1                                          	help="Model validation every 'n' epochs"
--profiler                   	action="store_true"                                            	help="Enable eval time profiler"

Monitor training

tensorboard --logdir callback_path

LibriSpeech Performance

Model Size Type Params (M) test-clean/test-other gready WER (%) test-clean/test-other n-gram WER (%) GPUs
Efficient Conformer Small CTC 13.2 3.6 / 9.0 2.7 / 6.7 4 x RTX 2080 Ti
Efficient Conformer Medium CTC 31.5 3.0 / 7.6 2.4 / 5.8 4 x RTX 2080 Ti
Efficient Conformer Large CTC 125.6 2.5 / 5.8 2.1 / 4.7 4 x RTX 3090

Reference

Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.

Author

Comments
  • mean loss inf - batch loss: inf

    mean loss inf - batch loss: inf

    During the training, I have around 9000+ batches of each having 32 audio segment data. After running the training for 1 epoch the mean loss and batch loss haven't changed from inf. Can you give some idea where the problem might be?

    opened by debasish-mihup 9
  • Beam Search Decoding giving poor accuracy compared to Gready Search Decoding

    Beam Search Decoding giving poor accuracy compared to Gready Search Decoding

    Hello,

    I have trained EfficientConformerTransducerSmall model using my own data. I have used the sentencepiece tokenizer and it is also trained on my data. While decoding using best model, gready search decoding is giving around 35% WER but beam search decoding is giving higher WER( aound 40%) with beam size 5 and 16. Also with beam size 1 the WER is close to 35% which is as expected. I did the experiment with and without LM but the behaviour is same(i.e. WER is increasing with beam size higher than 1). Can you please explain why WER is inceasing for Beam Search decoding with beam size higher than 1?

    opened by Gouranga95 7
  • The loss becomes NaN after some training epochs

    The loss becomes NaN after some training epochs

    When I train the ConformerCTC with the same code. The loss becomes NaN after some training epochs. After looking into it, I found that the loss is becoming NaN for a few batches but that causes accumulated loss to become NaN too. So far, I have tried the following:

    1. Applied gradient clipping (both normalized and clipping by value)
    2. Decreased the learning rate.
    3. Applied weight clipping, to put a check if any weights are becoming too large.

    None of it has worked, can you please advise on this? Thanks!

    opened by harisgulzar1 5
  • Building ngram model from my own corpus

    Building ngram model from my own corpus

    Hello,

    thanks for sharing your ngram model, it does work fine with the Libre corpus in my test. Now that I want to take it one step further to train your model on my own corpus, I'd like to train my own ngram as well. I'm using the NeMo script but it doesn't seem to work. Here is the command I'm using.

    python train_kenlm.py --nemo_model_file stt_en_conformer_ctc_medium.nemo --train_file MyCorpus.txt --kenlm_bin_path ../../../../kenlm/build/bin --kenlm_model_file ./isd.arpa --ngram_length 6

    Here stt_en_conformer_ctc_medium.nemo is pretrained and downloaded with NeMo, and MyCorpus.txt is the collection of ground truth text. I didn't encode it as I suppose the script will do it.

    What am I missing?

    thanks! David

    opened by davyuan 5
  • Link to Pretrained Original-Conformer Models

    Link to Pretrained Original-Conformer Models

    Thanks for the amazing work. I want to test the original CTC-Conformer with this code. If you have placed the pre-trained models on GD, can you please share the link like you have shared the links to Efficient Conformer model on Colab page. Thanks

    opened by harisgulzar1 2
  • Could you share your pretrained ngram LM?

    Could you share your pretrained ngram LM?

    Hello,

    I'm following the link below to train my 6-gram LM for decoding. I use the downloaded the LibriSpeech corpus and use NeMo's CTC conformer medium model to train it. However I'm not seeing any improvement in WER compared to greedy search. The results actually became worse.

    https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html

    If you could share your detailed steps for training the 6-gram LM, or your pretrained model, it would be most helpful!

    David

    opened by davyuan 2
  • Demo Notebook - failing

    Demo Notebook - failing

    failing while running the original notebook in

    import json
    import glob
    import torch
    import torchaudio
    import IPython.display as ipd
    from functions import create_model
    import matplotlib.pyplot as plt
    
    
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-10-1e4ca53c14dc> in <module>()
          4 import torchaudio
          5 import IPython.display as ipd
    ----> 6 from functions import create_model
          7 import matplotlib.pyplot as plt
    
    4 frames
    /usr/local/lib/python3.7/dist-packages/warp_rnnt/__init__.py in <module>()
          1 import torch
    ----> 2 import warp_rnnt._C as core
          3 from typing import Optional, AnyStr
          4 from pkg_resources import get_distribution
          5 
    
    ImportError: /usr/local/lib/python3.7/dist-packages/warp_rnnt/_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor8data_ptrIfEEPT_v
    
    ---------------------------------------------------------------------------
    NOTE: If your import is failing due to a missing package, you can
    manually install dependencies using either !pip or !apt.
    
    To view examples of installing some common dependencies, click the
    "Open Examples" button below.
    ---------------------------------------------------------------------------
    
    opened by TzurV 2
  • Memory leak during CPU inference

    Memory leak during CPU inference

    I have trained efficient conformer transducer and during inference on a CPU in a flask based web app, I see there is a memory leak at

    https://github.com/burchim/EfficientConformer/blob/2f59ed2564c45c5978742fe1984cd48363064c88/models/encoders.py#L128

    x, attention, hidden = block(x, mask)

    The memory used at the above line while running an inference is never released, thus causing OOM is some time. Increase in memory in each iteration gets lowered but not totally removed if I use jemalloc.

    opened by debasish-mihup 2
  • LM model required when beam size greater than 1?

    LM model required when beam size greater than 1?

    When I tried running your inference script in non gready mode (beam size greater than 1), it was giving me error for not having my LM model present. Is it a compulsory requirement?

    opened by debasish-mihup 2
  • Questions about variational noise.

    Questions about variational noise.

    Hi Maxime @burchim , thanks for the released codes. I recently look into your implementation of RNN-Transducer and Conformer and have these questions:

    1. You apply the variational noise after training 20k steps. Is there any reference about this manner? Or is it a tuned setting according to your experiments?
    2. Is the broadcast of variational noise across all processes needed for distributed data parallel training?

    Thanks! Huahuan.

    opened by maxwellzh 2
  • The LM model expected is at word level or at token level?

    The LM model expected is at word level or at token level?

    I wanted to confirm whether the LM model is expected to be at word level or token level? Usually KenLM model is trained at word level and in our case, we are using tokenizer (n=1000), and should I need to train it at token level or word level?

    opened by kafan1986 1
Owner
Maxime Burchi
Master of Engineering in Computer Science, ESIEE Paris
Maxime Burchi
Code for the ICASSP-2021 paper: Continuous Speech Separation with Conformer.

Continuous Speech Separation with Conformer Introduction We examine the use of the Conformer architecture for continuous speech separation. Conformer

Sanyuan Chen (ι™ˆδΈ‰ε…ƒ) 81 Nov 28, 2022
Conformer: Local Features Coupling Global Representations for Visual Recognition

Conformer: Local Features Coupling Global Representations for Visual Recognition (arxiv) This repository is built upon DeiT and timm Usage First, inst

Zhiliang Peng 378 Jan 8, 2023
Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks

Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks (SDPoint) This repository contains the cod

Jason Kuen 17 Jul 4, 2022
Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

Abhinav Kumar 76 Jan 2, 2023
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version>=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

Soohwan Kim 40 Sep 19, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

null 2 Jan 5, 2023
Implementation of the πŸ˜‡ Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nΒ²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

Ankush Malaker 5 Nov 11, 2022
Automatic self-diagnosis program (python required)Automatic self-diagnosis program (python required)

auto-self-checker μžλ™μœΌλ‘œ μžκ°€μ§„λ‹¨ ν•΄μ£ΌλŠ” ν”„λ‘œκ·Έλž¨(python ν•„μš”) μ€‘μš” 이 ν”„λ‘œκ·Έλž¨μ΄ μ‹€ν–‰λ λ•Œμ—λŠ” μ ˆλŒ€λ‘œ λ§ˆμš°μŠ€ν¬μΈν„°λ₯Ό μ›€μ§μ΄κ±°λ‚˜ ν‚€λ³΄λ“œλ₯Ό κ±΄λ“œλ¦¬λ©΄ μ•ˆλœλ‹€(화면인식, λ§ˆμš°μŠ€ν¬μΈν„°λ‘œ 직접 클릭) μ‚¬μš©λ²• ν”„λ‘œκ·Έλž¨μ„ ꡬ동할 폴더 λ‚΄μ˜ cmdμ°½μ—μ„œ pip

null 1 Dec 30, 2021
PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation.

PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation. Warning: the master branch might collapse. To ob

null 559 Dec 14, 2022
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetu

null 3 Dec 5, 2022
Multi-Stage Progressive Image Restoration

Multi-Stage Progressive Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Sh

Syed Waqas Zamir 859 Dec 22, 2022
Progressive Domain Adaptation for Object Detection

Progressive Domain Adaptation for Object Detection Implementation of our paper Progressive Domain Adaptation for Object Detection, based on pytorch-fa

null 96 Nov 25, 2022
Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

spyflying 55 Dec 1, 2022
(AAAI 2021) Progressive One-shot Human Parsing

End-to-end One-shot Human Parsing This is the official repository for our two papers: Progressive One-shot Human Parsing (AAAI 2021) End-to-end One-sh

null 54 Dec 30, 2022
Official code of paper "PGT: A Progressive Method for Training Models on Long Videos" on CVPR2021

PGT Code for paper PGT: A Progressive Method for Training Models on Long Videos. Install Run pip install -r requirements.txt. Run python setup.py buil

Bo Pang 27 Mar 30, 2022