Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Yiming Wang

Last update: Jan 3, 2023

Related tags

Text Data & NLP python end-to-end pytorch speech-recognition kaldi asr fairseq

Overview

Espresso

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

We provide state-of-the-art training recipes for the following speech datasets:

What's New:

April 2021: On-the-fly feature extraction from raw waveforms with torchaudio is supported. A LibriSpeech recipe is released here with no dependency on Kaldi and using YAML files (via Hydra) for configuring experiments.
June 2020: Transformer recipes released.
April 2020: Both E2E LF-MMI (using PyChain) and Cross-Entropy training for hybrid ASR are now supported. WSJ recipes are provided here and here as examples, respectively.
March 2020: SpecAugment is supported and relevant recipes are released.
September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly pip installed.

Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To install Espresso from source and develop locally:

git clone https://github.com/freewym/espresso
cd espresso
pip install --editable .

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
pip install kaldi_io sentencepiece soundfile
cd espresso/tools; make KALDI=<path/to/a/compiled/kaldi/directory>

add your Python path to PATH variable in examples/asr_<dataset>/path.sh, the current default is ~/anaconda3/bin.

kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. soundfile is required for reading raw waveform files. Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.

If you want to use PyChain for LF-MMI training, you also need to install PyChain (and OpenFst):

edit PYTHON_DIR variable in espresso/tools/Makefile (default: ~/anaconda3/bin), and then

cd espresso/tools; make openfst pychain

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

License

Espresso is MIT-licensed.

Citation

Please cite Espresso as:

@inproceedings{wang2019espresso,
  title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
  author = {Yiming Wang and Tongfei Chen and Hainan Xu 
            and Shuoyang Ding and Hang Lv and Yiwen Shao 
            and Nanyun Peng and Lei Xie and Shinji Watanabe 
            and Sanjeev Khudanpur},
  booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year = {2019},
}

Comments

Non-ASCII characters in sample PRD and REF

Hi, During training swbd recipe, the log on screen:

| sample PRD: \xe2\x96\x81maybe\xe2\x96\x81c' | sample REF: b'\xe2\x96\x81[vocalized-noise]'

Also the WER on swbd val set is very large. Is this normal? Thanks in advance.

opened by Shujian2015 43
Wanna know which recipe involve multi-level LM model train and decoding. Also, can we use word + subword as multi-level decoding ?
What is your question?

As what stated in subject:

Wanna know which recipe involve multi-level LM model train and decoding.

Can we use word + subword as multi-level decoding ? and how ?

What have you tried?

Have read the librispeech and wsj recipe, but unable to see some clear idea on how to enable the multi-level (word + sub-word) in LSTM (ASR) model decoding.

What's your environment?

fairseq Version (e.g., 1.0 or master):

PyTorch Version : 1.4.0

OS (e.g., Linux): Centos7

How you installed fairseq (pip, source): pip

Python version: 3.7

CUDA/cuDNN version: 10.0

question
opened by PhenixCFLi 30

SWBD Recipe Error

Hi, I am trying to run the SWBD recipe on my local machine. I am getting errors at Stage 2 of the run script, building the dictionary and text tokenization. The error seems to be coming from the "tokenizing text for train/valid/test sets..." stage running spm_encode.py.

Code

This is the full shell output:

sentencepiece_trainer.cc(116) LOG(INFO) Running command: --bos_id=-1 --pad_id=0 --eos_id=1 --unk_id=2 --input=data/lang/input --vocab_size=1003 --character_coverage=1.0 --model_type=unigram --model_prefix=data/lang/train_nodup_unigram1000 --input_sentence_size=10000000 --user_defined_symbols=[laughter],[noise],[vocalized-noise]
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
TrainerSpec {
  input: data/lang/input
  input_format:
  model_prefix: data/lang/train_nodup_unigram1000
  model_type: UNIGRAM
  vocab_size: 1003
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 10000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  user_defined_symbols: [laughter]
  user_defined_symbols: [noise]
  user_defined_symbols: [vocalized-noise]
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 2
  bos_id: -1
  eos_id: 1
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: data/lang/input
trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (2416025), which may slow down training.
trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
trainer_interface.cc(315) LOG(INFO) Loaded all 2416025 sentences
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <pad>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [laughter]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [noise]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [vocalized-noise]
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=120465092
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=43
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 2416025 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(133) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(184) LOG(INFO) Initialized 166028 seed sentencepieces
trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 2416025
trainer_interface.cc(451) LOG(INFO) Done! 69957
unigram_model_trainer.cc(470) LOG(INFO) Using 69957 sentences for EM training
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=59852 obj=9.23769 num_tokens=130093 num_tokens/piece=2.17358
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=44412 obj=7.29956 num_tokens=132354 num_tokens/piece=2.98014
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=33308 obj=7.24442 num_tokens=141637 num_tokens/piece=4.25234
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=33303 obj=7.23651 num_tokens=141660 num_tokens/piece=4.25367
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=24977 obj=7.21871 num_tokens=158375 num_tokens/piece=6.34083
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=24977 obj=7.21644 num_tokens=158399 num_tokens/piece=6.34179
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=18732 obj=7.21162 num_tokens=175442 num_tokens/piece=9.3659
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=18732 obj=7.20821 num_tokens=175404 num_tokens/piece=9.36387
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=14049 obj=7.21798 num_tokens=192101 num_tokens/piece=13.6736
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=14049 obj=7.21295 num_tokens=192059 num_tokens/piece=13.6707
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=10536 obj=7.23918 num_tokens=207654 num_tokens/piece=19.709
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=10536 obj=7.23244 num_tokens=207609 num_tokens/piece=19.7047
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=7902 obj=7.27241 num_tokens=221580 num_tokens/piece=28.041
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=7902 obj=7.26387 num_tokens=221484 num_tokens/piece=28.0289
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=5926 obj=7.32839 num_tokens=234743 num_tokens/piece=39.6124
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=5926 obj=7.31716 num_tokens=234693 num_tokens/piece=39.6039
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=4444 obj=7.40817 num_tokens=248571 num_tokens/piece=55.9341
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=4444 obj=7.39317 num_tokens=248418 num_tokens/piece=55.8996
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=3333 obj=7.50897 num_tokens=262750 num_tokens/piece=78.8329
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=3333 obj=7.49001 num_tokens=262534 num_tokens/piece=78.7681
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=2499 obj=7.64161 num_tokens=276859 num_tokens/piece=110.788
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=2499 obj=7.61733 num_tokens=276640 num_tokens/piece=110.7
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1874 obj=7.80273 num_tokens=292799 num_tokens/piece=156.243
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1874 obj=7.77333 num_tokens=292543 num_tokens/piece=156.106
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1405 obj=7.99379 num_tokens=309225 num_tokens/piece=220.089
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1405 obj=7.95503 num_tokens=308821 num_tokens/piece=219.801
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1103 obj=8.15973 num_tokens=321388 num_tokens/piece=291.376
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1103 obj=8.12422 num_tokens=321274 num_tokens/piece=291.273
trainer_interface.cc(507) LOG(INFO) Saving model: data/lang/train_nodup_unigram1000.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: data/lang/train_nodup_unigram1000.vocab
Traceback (most recent call last):
  File "../../scripts/spm_encode.py", line 99, in <module>
    main()
  File "../../scripts/spm_encode.py", line 90, in main
    print(" ".join(enc_line), file=output_h)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2581' in position 0: ordinal not in range(128)

What have you tried?

My setup should be ok as I have been running the WSJ recipe without issue but I notice that a different script is used here for the tokenizing. Any help or advice would be great!

question

opened by annamine 17

Error in fp16 training

Hi @freewym , have you had a chance to train the model with float 16 precision? I experienced such error in the swb recipe:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "<path>/codebase/espresso/speech_train.py", line 354, in distributed_main main(args, init_distributed=True) File "<path>/codebase/espresso/speech_train.py", line 128, in main train(args, trainer, task, epoch_itr) File "<path>/codebase/espresso/speech_train.py", line 173, in train log_output = trainer.train_step(samples) File "<path>/codebase/espresso/fairseq/trainer.py", line 342, in train_step raise e File "<path>/codebase/espresso/fairseq/trainer.py", line 306, in train_step ignore_grad File "<path>/codebase/espresso/fairseq/tasks/fairseq_task.py", line 249, in train_step optimizer.backward(loss) File "<path>/codebase/espresso/fairseq/optim/fp16_optimizer.py", line 103, in backward loss.backward() File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/tensor.py", line 150, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: expected scalar type Float but found Half

opened by Shujian2015 15
SIGSEGV while running train.py on a multi GPU setup

I have setup a ubuntu 18.04 4 CPU and 4 GPU environment to execute the librispeech dataset training.

The prepare step went through fine.

But when I launch the training using: python train.py ./librispeech-workdir/preprocessed-data/ --save-dir ./librispeech-workdir/train-output/ --max-epoch 80 --task speech_recognition_e --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/

I get the following error right at the outset: ) | model vggtransformer_2, criterion CrossEntropyWithAccCriterion | num. model params: 315190057 (num. trained: 315190057) | training on 4 GPUs | max tokens per GPU = 5000 and max sentences per GPU = None | no existing checkpoint found ./librispeech-workdir/train-output/checkpoint_last.pt | loading train data for epoch 0 Traceback (most recent call last): File "train.py", line 343, in cli_main() File "train.py", line 335, in cli_main nprocs=args.distributed_world_size, File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGSEGV

Unable to proceed ahead in teh absence of any clues a to what might be causing it etc

Please help

It starts out with

| distributed init (rank 3): tcp://localhost:15160 | distributed init (rank 0): tcp://localhost:15160 | distributed init (rank 2): tcp://localhost:15160 | distributed init (rank 1): tcp://localhost:15160 | initialized host espresso-2 as rank 2 | initialized host espresso-2 as rank 1 | initialized host espresso-2 as rank 3 | initialized host espresso-2 as rank 0 Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm= 10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='./librispeech-workdir/preprocessed-data/', dataset_i mpl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:15160', distributed_no_ spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, empty_cache_freq=0, enc_output_dim=1024, fast_stat_sync=False, find_unused_parame ters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_windo w=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0. 1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memor y_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_op timizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=Fa lse, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./librispeech-workdir/train-output/', save_interval=1, save_interval _updates=0, seed=1, sentence_avg=False, silence_token='▁', skip_invalid_size_inputs_valid_test=False, task='speech_recognition_e', tbmf_wrapper=False, tensorboar d_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15 ),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[1], use_bmuf=False, user_dir='examples/speech_recognition/', va lid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0) | dictionary: 5001 types

(I have had to rename the speech_recognition task to speech_recognition_e as there is a similarly named task in fairseq directory as well)

opened by chandraka 15

tensorized_lookahead_language_model SyntaxError

Hi~ I was running the asr_wsj and got SyntaxError: invalid syntax.

this is the info.

File "/share/nas165/QAQ/espresso/fairseq/models/tensorized_lookahead_language_model.py", line 61
    self.lm_decoder: FairseqIncrementalDecoder = word_lm.decoder
                   ^
SyntaxError: invalid syntax

Anyone could help me? tyvm

opened by DTDwind 10

Problem with Long Utterances for MALACH Corpus

I am trying to use espresso to decode the MALACH Corpus. One of the characteristics of MALACH is that the training utterances are all short ( < 8 secs on the whole) but the test data contains a significant number of long utterances ( . 20 seconds). I am observing that on these long utterances it produces decent output for the first 5-6 seconds, deteriorates rapidly thereafter, puts out some repeated words, and then stops decoding resulting in many deletions. This is for a transformer model based on the wsj recipe. MALACH has about 160 hours of training data. I would welcome some suggestions/help here - it almost looks like some parameter setting would fix things.

Thanks Michael
question

opened by picheny-nyu 8

I tried to run a librispeech recipe but a word error rate remains very large.

What is your question?

I tried to run a librispeech recipe (examples/asr_librispeech/run.sh) but a word error rate remains very large(around 100%) in "Stage 8: Model Training" in spite of 30 epochs. I think a cause is that an execution environment is different.

What's your environment?

My environment is as follows.

fairseq Version (e.g., 1.0 or master): 0.9.0
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): Ubuntu 18.04.3 LTS
How you installed fairseq (pip, source): pip
Python version: 3.6.5
CUDA/cuDNN version: 10.0.130 / libcudnn.so.7.3.0
GPU models and configuration: Tesla V100-SXM2-16GB

$ python collect_env.py 
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 440.33.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.0

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.4.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.0                      166  
[conda] mkl-service               2.3.0            py36he904b0f_0  
[conda] mkl_fft                   1.0.15           py36ha843d7b_0  
[conda] mkl_random                1.1.0            py36hd6b4f25_0  
[conda] pytorch                   1.4.0           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
[conda] torch                     1.4.0                    pypi_0    pypi

A commit hash of the espresso is f933e8cfff38a20bd7ea76dd4cb3a1fa809eab83. An output log is as follows, but I can't find any problem.

2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: I SEE I MUST GET SETTLED QUICKLY SO THAT I SHALL HAVE THE POWER TO RESTRAIN YOU THEY ROLLICKED FORTH THEN AND BOUGHT SEVERAL THINGS A BIG STEAMER RUG FOR THE CAR A PAIR OF LONG GRAY MOCHA GLOVES TO MATCH THE HAND BAG A SILK UMBRELLA
2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND HAVE THAT' BE A IN AND I I CAN BE TO PLEASURE OF GET THE I ARE UPED AND AND THE THERE THE OF ANDPIECEGERER ANDG AND THE SHIPS FEW OF SHOES WHITE HAIRSACKS WHICH MAN OF PAIR HANDKERCHIEF
2020-03-08 23:13:04 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.804 | nll_loss 5.856 | wer 109.669 | cer 100.424 | ppl 57.94 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 414000 | best_wer 96.7413
2020-03-08 23:13:27 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint_25_414000.pt (epoch 25 @ 414000 updates, score 109.6687) (writing took 23.222585418028757 seconds)
2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: CONTINUED DUNCAN SPEAKING SLOWLY AND USING THE SIMPLEST FRENCH OF WHICH HE WAS THE MASTER TO BELIEVE THAT NONE OF THIS WISE AND BRAVE NATION UNDERSTAND THE LANGUAGE THAT THE GRAND MONARQUE USES WHEN HE TALKS TO HIS CHILDREN
2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND THECAN WITH IN AND IING THE WORDSST WAY LANGUAGE THE HE WAS THE MOST OF BE THAT HE OF THE WAS AND IN MAN COULDS LANGUAGE OF HE WORLDESTITTSS HE ISS TO THE PEOPLE
2020-03-08 23:23:27 | INFO | train | epoch 025:  15999 / 16601 loss=6.624, nll_loss=5.649, ppl=50.18, wps=537.4, ups=0.75, wpb=716.6, bsz=16.9, num_updates=414424, lr=1e-05, gnorm=0.399, clip=0, oom=0, train_wall=13155, wall=554591
2020-03-08 23:34:58 | INFO | train | epoch 025 | loss 6.624 | nll_loss 5.649 | ppl 50.18 | wps 539.9 | ups 0.75 | wpb 716.2 | bsz 16.9 | num_updates 415025 | lr 1e-05 | gnorm 0.4 | clip 0 | oom 0 | train_wall 13641 | wall 555281
2020-03-08 23:37:49 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.803 | nll_loss 5.856 | wer 108.198 | cer 99.5523 | ppl 57.92 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 415025 | best_wer 96.7413
2020-03-08 23:38:12 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint25.pt (epoch 25 @ 415025 updates, score 108.1984) (writing took 23.870150407077745 seconds)

Because the librispeech is very large dataset, I struggle with debugging. Could you give me any hint?

I think if the espresso has a recipe for a small dataset like the an4 recipe in the espnet, a trial run is easier. Do you have any plan to implement a recipe for a small dataset?

thanks.

question

opened by ken57 8

Error found when running librispeech recipe with latest version of espresso

🐛 Bug

There are two issues after install the latest version of espresso:

The specaug parameter parsing errro occur once we enable the specaug function

2020-11-11 12:04:42 | INFO | espresso.speech_train | --max-tokens is the maximum number of input frames in a batch
Traceback (most recent call last):
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 415, in <module>
    cli_main()
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 404, in cli_main
    cfg = convert_namespace_to_omegaconf(args)
  File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/fairseq/dataclass/utils.py", line 324, in convert_namespace_to_omegaconf
    composed_cfg = compose("config", overrides=overrides, strict=False)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/experimental/compose.py", line 31, in compose
    cfg = gh.hydra.compose_config(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 507, in compose_config
    cfg = self.config_loader.load_configuration(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration
    return self._load_configuration(
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 180, in _load_configuration
    parsed_overrides = parser.parse_overrides(overrides=overrides)
  File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/core/override_parser/overrides_parser.py", line 95, in parse_overrides
    raise OverrideParseException(
hydra.errors.OverrideParseException: mismatched input 'W' expecting <EOF>
See https://hydra.cc/docs/next/advanced/override_grammar/basic for details

It crash in model training step (step 8) without any error

2020-11-11 12:38:55 | INFO | espresso.speech_train | task: SpeechRecognitionEspressoTask
2020-11-11 12:38:55 | INFO | espresso.speech_train | model: SpeechLSTMModel
2020-11-11 12:38:55 | INFO | espresso.speech_train | criterion: LabelSmoothedCrossEntropyV2Criterion)
2020-11-11 12:38:55 | INFO | espresso.speech_train | num. model params: 159660204 (num. trained: 159660204)
2020-11-11 12:38:55 | INFO | fairseq.trainer | detected shared parameter: decoder.attention.query_proj.bias <- decoder.attention.value_proj.bias
2020-11-11 12:38:55 | INFO | espresso.speech_train | training on 1 devices (GPUs/TPUs)
2020-11-11 12:38:55 | INFO | espresso.speech_train | max tokens per GPU = 26000 and batch size per GPU = 24
2020-11-11 12:38:55 | INFO | fairseq.trainer | no existing checkpoint found exp/lstm_wsj.specaug.bpe1k/checkpoint_last.pt
2020-11-11 12:38:55 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-11 12:39:05 | INFO | espresso.tasks.speech_recognition | /nfs/mercury-13/u20/cli/src/espresso.latest/espresso/examples/asr_librispeech/data-bulgarian-bpe1k/train.json 33004 examples
./run.sh: line 259:  4839 Segmentation fault      CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py $data_dir --task speech_recognition_espresso --seed 1 --log-interval $((8000/ngpus/update_freq)) --log-format simple --print-training-sample-interval $((4000/ngpus/update_freq)) --num-workers 0 --data-buffer-size 0 --max-tokens 26000 --batch-size 24 --curriculum 1 --empty-cache-freq 50 --valid-subset $valid_subset --batch-size-valid 48 --ddp-backend no_c10d --update-freq $update_freq --distributed-world-size $ngpus --optimizer adam --lr 0.001 --weight-decay 0.0 --clip-norm 2.0 --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates $((6000/ngpus/update_freq)) --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer --criterion label_smoothed_cross_entropy_v2 --label-smoothing 0.1 --smoothing-type uniform --dict $dict --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model --max-source-positions 9999 --max-target-positions 999 $opts --specaugment-config "$specaug_config" 2>&1

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd: ./run.sh
See error: listed above

Expected behavior

Able to train model with the recipe

Environment

fairseq Version (e.g., 1.0 or master): 1.0.0a0+d966482
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): CentOS Linux release 7.7.1908 (Core)
How you installed fairseq (pip, source): pip install from source
Build command you used (if compiling from source): pip install --editable .
Python version: 3.8.5
CUDA/cuDNN version: py3.8_cuda10.0.130_cudnn7.6.3_0
GPU models and configuration:
Any other relevant information:

Additional context

bug

opened by PhenixCFLi 7

Verify WER by scoring with Kaldi

Hi authors, I'm using Librispeech run.sh recipe. I trained the acoustic model (speech_conv_lstm_librispeech) using 4 GPU 1080ti But I'm facing this error while doing kaldi scoring. local/score.sh data/test_clean exp/lstm/decode_test_clean_shallow_fusion run.pl: job failed, log is in exp/lstm/decode_test_clean_shallow_fusion/scoring_kaldi/log/score.log My second question Is there any documentation for using my pre-trained model to decode audio wav, i would like to compare the decoding speed between ESPNet and Esspresso https://arxiv.org/abs/1909.08723
question

opened by ahmedalbahnasawy 7
Slow training...
Hello,

I have spent some time to compare pychain LF-MMI in Espresso and the pychain_example, which seems to borrow some code from Espresso. I get very slow forward passes in Espresso while they are much faster in PyChain_example (I use DistributedDataParallel on both Espresso ('no_c10d' backend, which uses NCCL anyway?) and PyChain (with 'nccl') ). I use a TDNN model which is the same, same architecture cnn/bn/relu implementation matched from Espresso to PyChain: 6 TDNN+BN+ReLU layers, strides=(1,1,1,1,1,3), dilation=(1,1,1,3,3,3), kernels=(3,3,3,3,3,3), no residual connections. Both use curriliculum learning in the first epoch and start with shortest batches.

Espresso code:

start= time.time() for i in range(len(self.tdnn)): if self.residual and i > 0: # residual connection starts from the 2nd layer prev_x = x x, x_lengths, padding_mask = self.tdnn[i](x, x_lengths) x = self.dropout_out_module(x) x = x + prev_x if self.residual and i > 0 and x.size(1) == prev_x.size(1) else x print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())

PyChain code:

start = time.time() for i in range(len(self.tdnn)): if self.residual and i>0: x_prev = x x, x_lengths = self.tdnn[i](x, x_lengths) x = F.dropout(x, p=self.dropout, training=self.training) if self.residual and i>0 and x.size(1)==x_prev.size(1): x += x_prev print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())

So, the code is almost line-by-line the same, architecture is the same. Yet, after using DistributedDataParallel, Espresso is much slower. This was run on the same machine, same 2 GPUs, one experiment right after the other (so no load change issues on the machine). I checked that the computing the padding does not significantly affect the timing. Here are the timings for several forward passess of similar size.

Espresso: 6xTDNN time 2.42642s tensor_in_size torch.Size([64, 158, 40]) tensor_out_size torch.Size([64, 53, 640]) gpu 1 6xTDNN time 2.39317s tensor_in_size torch.Size([64, 177, 40]) tensor_out_size torch.Size([64, 59, 640]) gpu 1 6xTDNN time 1.95155s tensor_in_size torch.Size([64, 144, 40]) tensor_out_size torch.Size([64, 48, 640]) gpu 0 6xTDNN time 2.50637s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0 6xTDNN time 1.79735s tensor_in_size torch.Size([64, 192, 40]) tensor_out_size torch.Size([64, 64, 640]) gpu 1 6xTDNN time 2.37481s tensor_in_size torch.Size([64, 186, 40]) tensor_out_size torch.Size([64, 62, 640]) gpu 0

... PyChain: 6xTDNN time 0.07956s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0 6xTDNN time 0.08923s tensor_in_size torch.Size([64, 194, 40]) tensor_out_size torch.Size([64, 65, 640]) gpu 1 6xTDNN time 0.08312s tensor_in_size torch.Size([64, 211, 40]) tensor_out_size torch.Size([64, 71, 640]) gpu 0 6xTDNN time 0.08275s tensor_in_size torch.Size([64, 224, 40]) tensor_out_size torch.Size([64, 75, 640]) gpu 1 6xTDNN time 0.08598s tensor_in_size torch.Size([64, 233, 40]) tensor_out_size torch.Size([64, 78, 640]) gpu 0 6xTDNN time 0.08788s tensor_in_size torch.Size([64, 241, 40]) tensor_out_size torch.Size([64, 81, 640]) gpu 1 ... So, PyChain is 10-20 times faster... Espresso uses 40-50% of each GPU, while PyChain uses 85-95% when put together with the LF-MMI loss. I wonder how to make Espresso train as fast as PyChain shows it is possible. Is it a matter of the DistributedDataParallel imlementation in Fairseq? the backend? Any help is welcome.
question
opened by maff20 6
SHA hashes in 'main' branch are different from those in the 'origin/main'
Hello, i am experimenting with Espresso, trying the switchboard recipe on air-traffic-control data. I noticed my local SHA hashes in 'main' branch are different from those in the 'origin/main'. I tried to 'pull' from the 'origin', but i am getting conflicts due to that.

Can it be caused by the local installation via pip install --editable . ?

Have you seen this issue before ? Is it normal, or did I do something wrong ?

How do you edit code, contribute and test locally normally ?

Best regards, Karel
question
opened by vesis84 2
Android Espresso not able to test fragement
❓ Questions and Help

Android Espresso not able to test fragement I am trying to launch a fragment as below

override fun onCreateOptionsMenu(menu: Menu, inflater: MenuInflater) { inflater.inflate(R.menu.menu_home, menu)

menuNotification.icon = NotificationHelper.getNotificationDrawable(UserPool.userId )

What have you tried?

private lateinit var homeFragmentScenario: FragmentScenario

@MockK lateinit var mockPool: UserPool

@Before fun setUp() { InjectMocksRule.createMockK(this) ActivityScenario.launch(MainActivity::class.java) homeFragmentScenario= launchFragmentInContainer(themeResId = R.style.AppTheme) homeFragmentScenario.moveToState(Lifecycle.State.STARTED) Intents.init()

}

@Test

fun loadScreen() { every { mockPool.userId } answers {"123456"} Espresso.onView(ViewMatchers.withId(R.id.layout_home)) .check(ViewAssertions.matches(ViewMatchers.isDisplayed())) }
question
opened by AbhishekArrk 0
hydra.errors.ConfigCompositionException: Could not override 'task.data'.

👉 Please follow one of these issue templates 👈

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates. wenn I run the stage7 in run_torchaudio.sh.There is always a problem: hydra.errors.ConfigCompositionException: Could not override 'task.data'. Maybe the problem is in the python file hydra_train.python Could not override 'task.data'. To append to your config use +task.data=/espresso/examples/asr_librispeech/data Key 'data' is not in struct full_key: task.data reference_type=Any object_type=dict How can I solve it.

opened by kai-dll 1
Batchnorm and masking

It looks like the batchnorm doesn't take into account the masking:

https://github.com/freewym/espresso/blob/6fca6cacd9d475d2676c527999e2d1bde08e7cbb/espresso/models/speech_tdnn.py#L170

Surely this isn't right? However I don't know how to take it into account.

opened by danpovey 4
TIMIT Demo example

🚀 Feature Request

Would it be possible to upload an example for TIMIT for demonstration purpose? All other Speech Recognition datasets are kinda too large to download when just trying out this repo. Having TIMIT would make allow people new to ASR to quickly try out and appreciate the convinience of this framework. Thanks.

Motivation

Pitch

Alternatives

Additional context
enhancement help wanted

opened by jedyang97 1

Owner

Yiming Wang

GitHub

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

26 Dec 14, 2022

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

86 Jun 11, 2021

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

513 Jan 3, 2023

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

5.9k Jan 3, 2023

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper：An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

3 Apr 2, 2022

End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

214 Dec 7, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

31 Aug 16, 2021

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021

A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

647 Dec 25, 2022

Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

34 Sep 8, 2022

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

End to end text to speech system using gruut and onnx

673 Dec 28, 2022

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Related tags

Overview

Espresso

What's New:

Requirements and Installation

License

Citation

Comments

What is your question?

What have you tried?

What's your environment?

Code

What have you tried?

What is your question?

What's your environment?

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

❓ Questions and Help

What have you tried?

👉 Please follow one of these issue templates 👈

🚀 Feature Request

Motivation

Pitch

Alternatives

Additional context

Owner

Yiming Wang

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

End-to-End Speech Processing Toolkit

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

End-2-end speech synthesis with recurrent neural networks

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

Speech Recognition for Uyghur using Speech transformer

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

A PyTorch Implementation of End-to-End Models for Speech-to-Text

Athena is an open-source implementation of end-to-end speech processing engine.

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.