Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Overview

Espresso

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

We provide state-of-the-art training recipes for the following speech datasets:

What's New:

  • April 2021: On-the-fly feature extraction from raw waveforms with torchaudio is supported. A LibriSpeech recipe is released here with no dependency on Kaldi and using YAML files (via Hydra) for configuring experiments.
  • June 2020: Transformer recipes released.
  • April 2020: Both E2E LF-MMI (using PyChain) and Cross-Entropy training for hybrid ASR are now supported. WSJ recipes are provided here and here as examples, respectively.
  • March 2020: SpecAugment is supported and relevant recipes are released.
  • September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly pip installed.

Requirements and Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install Espresso from source and develop locally:
git clone https://github.com/freewym/espresso
cd espresso
pip install --editable .

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
pip install kaldi_io sentencepiece soundfile
cd espresso/tools; make KALDI=<path/to/a/compiled/kaldi/directory>

add your Python path to PATH variable in examples/asr_<dataset>/path.sh, the current default is ~/anaconda3/bin.

kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. soundfile is required for reading raw waveform files. Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.

  • If you want to use PyChain for LF-MMI training, you also need to install PyChain (and OpenFst):

edit PYTHON_DIR variable in espresso/tools/Makefile (default: ~/anaconda3/bin), and then

cd espresso/tools; make openfst pychain
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

License

Espresso is MIT-licensed.

Citation

Please cite Espresso as:

@inproceedings{wang2019espresso,
  title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
  author = {Yiming Wang and Tongfei Chen and Hainan Xu 
            and Shuoyang Ding and Hang Lv and Yiwen Shao 
            and Nanyun Peng and Lei Xie and Shinji Watanabe 
            and Sanjeev Khudanpur},
  booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year = {2019},
}
Comments
  • Non-ASCII characters in sample PRD and REF

    Non-ASCII characters in sample PRD and REF

    Hi, During training swbd recipe, the log on screen:

    | sample PRD: \xe2\x96\x81maybe\xe2\x96\x81c' | sample REF: b'\xe2\x96\x81[vocalized-noise]'

    Also the WER on swbd val set is very large. Is this normal? Thanks in advance.

    opened by Shujian2015 43
  • Wanna know which recipe involve multi-level LM model train and decoding. Also, can we use word + subword as multi-level decoding ?

    Wanna know which recipe involve multi-level LM model train and decoding. Also, can we use word + subword as multi-level decoding ?

    What is your question?

    As what stated in subject:

    1. Wanna know which recipe involve multi-level LM model train and decoding.
    2. Can we use word + subword as multi-level decoding ? and how ?

    What have you tried?

    Have read the librispeech and wsj recipe, but unable to see some clear idea on how to enable the multi-level (word + sub-word) in LSTM (ASR) model decoding.

    What's your environment?

    • fairseq Version (e.g., 1.0 or master):
    • PyTorch Version : 1.4.0
    • OS (e.g., Linux): Centos7
    • How you installed fairseq (pip, source): pip
    • Python version: 3.7
    • CUDA/cuDNN version: 10.0
    question 
    opened by PhenixCFLi 30
  • SWBD Recipe Error

    SWBD Recipe Error

    Hi, I am trying to run the SWBD recipe on my local machine. I am getting errors at Stage 2 of the run script, building the dictionary and text tokenization. The error seems to be coming from the "tokenizing text for train/valid/test sets..." stage running spm_encode.py.

    Code

    This is the full shell output:

    sentencepiece_trainer.cc(116) LOG(INFO) Running command: --bos_id=-1 --pad_id=0 --eos_id=1 --unk_id=2 --input=data/lang/input --vocab_size=1003 --character_coverage=1.0 --model_type=unigram --model_prefix=data/lang/train_nodup_unigram1000 --input_sentence_size=10000000 --user_defined_symbols=[laughter],[noise],[vocalized-noise]
    sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
    TrainerSpec {
      input: data/lang/input
      input_format:
      model_prefix: data/lang/train_nodup_unigram1000
      model_type: UNIGRAM
      vocab_size: 1003
      self_test_sample_size: 0
      character_coverage: 1
      input_sentence_size: 10000000
      shuffle_input_sentence: 1
      seed_sentencepiece_size: 1000000
      shrinking_factor: 0.75
      max_sentence_length: 4192
      num_threads: 16
      num_sub_iterations: 2
      max_sentencepiece_length: 16
      split_by_unicode_script: 1
      split_by_number: 1
      split_by_whitespace: 1
      treat_whitespace_as_suffix: 0
      user_defined_symbols: [laughter]
      user_defined_symbols: [noise]
      user_defined_symbols: [vocalized-noise]
      hard_vocab_limit: 1
      use_all_vocab: 0
      unk_id: 2
      bos_id: -1
      eos_id: 1
      pad_id: 0
      unk_piece: <unk>
      bos_piece: <s>
      eos_piece: </s>
      pad_piece: <pad>
      unk_surface:  ⁇
    }
    NormalizerSpec {
      name: nmt_nfkc
      add_dummy_prefix: 1
      remove_extra_whitespaces: 1
      escape_whitespaces: 1
      normalization_rule_tsv:
    }
    
    trainer_interface.cc(267) LOG(INFO) Loading corpus: data/lang/input
    trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
    trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
    trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (2416025), which may slow down training.
    trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
    trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
    trainer_interface.cc(315) LOG(INFO) Loaded all 2416025 sentences
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <pad>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [laughter]
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [noise]
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [vocalized-noise]
    trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
    trainer_interface.cc(384) LOG(INFO) all chars count=120465092
    trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
    trainer_interface.cc(402) LOG(INFO) Alphabet size=43
    trainer_interface.cc(403) LOG(INFO) Final character coverage=1
    trainer_interface.cc(435) LOG(INFO) Done! preprocessed 2416025 sentences.
    unigram_model_trainer.cc(129) LOG(INFO) Making suffix array...
    unigram_model_trainer.cc(133) LOG(INFO) Extracting frequent sub strings...
    unigram_model_trainer.cc(184) LOG(INFO) Initialized 166028 seed sentencepieces
    trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 2416025
    trainer_interface.cc(451) LOG(INFO) Done! 69957
    unigram_model_trainer.cc(470) LOG(INFO) Using 69957 sentences for EM training
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=59852 obj=9.23769 num_tokens=130093 num_tokens/piece=2.17358
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=44412 obj=7.29956 num_tokens=132354 num_tokens/piece=2.98014
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=33308 obj=7.24442 num_tokens=141637 num_tokens/piece=4.25234
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=33303 obj=7.23651 num_tokens=141660 num_tokens/piece=4.25367
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=24977 obj=7.21871 num_tokens=158375 num_tokens/piece=6.34083
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=24977 obj=7.21644 num_tokens=158399 num_tokens/piece=6.34179
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=18732 obj=7.21162 num_tokens=175442 num_tokens/piece=9.3659
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=18732 obj=7.20821 num_tokens=175404 num_tokens/piece=9.36387
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=14049 obj=7.21798 num_tokens=192101 num_tokens/piece=13.6736
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=14049 obj=7.21295 num_tokens=192059 num_tokens/piece=13.6707
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=10536 obj=7.23918 num_tokens=207654 num_tokens/piece=19.709
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=10536 obj=7.23244 num_tokens=207609 num_tokens/piece=19.7047
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=7902 obj=7.27241 num_tokens=221580 num_tokens/piece=28.041
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=7902 obj=7.26387 num_tokens=221484 num_tokens/piece=28.0289
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=5926 obj=7.32839 num_tokens=234743 num_tokens/piece=39.6124
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=5926 obj=7.31716 num_tokens=234693 num_tokens/piece=39.6039
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=4444 obj=7.40817 num_tokens=248571 num_tokens/piece=55.9341
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=4444 obj=7.39317 num_tokens=248418 num_tokens/piece=55.8996
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=3333 obj=7.50897 num_tokens=262750 num_tokens/piece=78.8329
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=3333 obj=7.49001 num_tokens=262534 num_tokens/piece=78.7681
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=2499 obj=7.64161 num_tokens=276859 num_tokens/piece=110.788
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=2499 obj=7.61733 num_tokens=276640 num_tokens/piece=110.7
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1874 obj=7.80273 num_tokens=292799 num_tokens/piece=156.243
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1874 obj=7.77333 num_tokens=292543 num_tokens/piece=156.106
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1405 obj=7.99379 num_tokens=309225 num_tokens/piece=220.089
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1405 obj=7.95503 num_tokens=308821 num_tokens/piece=219.801
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1103 obj=8.15973 num_tokens=321388 num_tokens/piece=291.376
    unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1103 obj=8.12422 num_tokens=321274 num_tokens/piece=291.273
    trainer_interface.cc(507) LOG(INFO) Saving model: data/lang/train_nodup_unigram1000.model
    trainer_interface.cc(531) LOG(INFO) Saving vocabs: data/lang/train_nodup_unigram1000.vocab
    Traceback (most recent call last):
      File "../../scripts/spm_encode.py", line 99, in <module>
        main()
      File "../../scripts/spm_encode.py", line 90, in main
        print(" ".join(enc_line), file=output_h)
    UnicodeEncodeError: 'ascii' codec can't encode character '\u2581' in position 0: ordinal not in range(128)
    

    What have you tried?

    My setup should be ok as I have been running the WSJ recipe without issue but I notice that a different script is used here for the tokenizing. Any help or advice would be great!

    question 
    opened by annamine 17
  • Error in fp16 training

    Error in fp16 training

    Hi @freewym , have you had a chance to train the model with float 16 precision? I experienced such error in the swb recipe:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "<path>/codebase/espresso/speech_train.py", line 354, in distributed_main main(args, init_distributed=True) File "<path>/codebase/espresso/speech_train.py", line 128, in main train(args, trainer, task, epoch_itr) File "<path>/codebase/espresso/speech_train.py", line 173, in train log_output = trainer.train_step(samples) File "<path>/codebase/espresso/fairseq/trainer.py", line 342, in train_step raise e File "<path>/codebase/espresso/fairseq/trainer.py", line 306, in train_step ignore_grad File "<path>/codebase/espresso/fairseq/tasks/fairseq_task.py", line 249, in train_step optimizer.backward(loss) File "<path>/codebase/espresso/fairseq/optim/fp16_optimizer.py", line 103, in backward loss.backward() File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/tensor.py", line 150, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "<path>/codebase/espresso/env/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: expected scalar type Float but found Half

    opened by Shujian2015 15
  • SIGSEGV while running train.py on a multi GPU setup

    SIGSEGV while running train.py on a multi GPU setup

    I have setup a ubuntu 18.04 4 CPU and 4 GPU environment to execute the librispeech dataset training.

    The prepare step went through fine.

    But when I launch the training using: python train.py ./librispeech-workdir/preprocessed-data/ --save-dir ./librispeech-workdir/train-output/ --max-epoch 80 --task speech_recognition_e --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/

    I get the following error right at the outset: ) | model vggtransformer_2, criterion CrossEntropyWithAccCriterion | num. model params: 315190057 (num. trained: 315190057) | training on 4 GPUs | max tokens per GPU = 5000 and max sentences per GPU = None | no existing checkpoint found ./librispeech-workdir/train-output/checkpoint_last.pt | loading train data for epoch 0 Traceback (most recent call last): File "train.py", line 343, in cli_main() File "train.py", line 335, in cli_main nprocs=args.distributed_world_size, File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGSEGV

    Unable to proceed ahead in teh absence of any clues a to what might be causing it etc

    Please help

    It starts out with


    | distributed init (rank 3): tcp://localhost:15160 | distributed init (rank 0): tcp://localhost:15160 | distributed init (rank 2): tcp://localhost:15160 | distributed init (rank 1): tcp://localhost:15160 | initialized host espresso-2 as rank 2 | initialized host espresso-2 as rank 1 | initialized host espresso-2 as rank 3 | initialized host espresso-2 as rank 0 Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm= 10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='./librispeech-workdir/preprocessed-data/', dataset_i mpl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:15160', distributed_no_ spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, empty_cache_freq=0, enc_output_dim=1024, fast_stat_sync=False, find_unused_parame ters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_windo w=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0. 1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memor y_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_op timizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=Fa lse, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./librispeech-workdir/train-output/', save_interval=1, save_interval _updates=0, seed=1, sentence_avg=False, silence_token='▁', skip_invalid_size_inputs_valid_test=False, task='speech_recognition_e', tbmf_wrapper=False, tensorboar d_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15 ),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[1], use_bmuf=False, user_dir='examples/speech_recognition/', va lid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0) | dictionary: 5001 types


    (I have had to rename the speech_recognition task to speech_recognition_e as there is a similarly named task in fairseq directory as well)

    opened by chandraka 15
  • tensorized_lookahead_language_model SyntaxError

    tensorized_lookahead_language_model SyntaxError

    Hi~ I was running the asr_wsj and got SyntaxError: invalid syntax.

    this is the info.

    File "/share/nas165/QAQ/espresso/fairseq/models/tensorized_lookahead_language_model.py", line 61
        self.lm_decoder: FairseqIncrementalDecoder = word_lm.decoder
                       ^
    SyntaxError: invalid syntax
    
    

    Anyone could help me? tyvm

    opened by DTDwind 10
  • Problem with Long Utterances for MALACH Corpus

    Problem with Long Utterances for MALACH Corpus

    I am trying to use espresso to decode the MALACH Corpus. One of the characteristics of MALACH is that the training utterances are all short ( < 8 secs on the whole) but the test data contains a significant number of long utterances ( . 20 seconds). I am observing that on these long utterances it produces decent output for the first 5-6 seconds, deteriorates rapidly thereafter, puts out some repeated words, and then stops decoding resulting in many deletions. This is for a transformer model based on the wsj recipe. MALACH has about 160 hours of training data. I would welcome some suggestions/help here - it almost looks like some parameter setting would fix things.

    Thanks Michael

    question 
    opened by picheny-nyu 8
  • I tried to run a librispeech recipe but a word error rate remains very large.

    I tried to run a librispeech recipe but a word error rate remains very large.

    What is your question?

    I tried to run a librispeech recipe (examples/asr_librispeech/run.sh) but a word error rate remains very large(around 100%) in "Stage 8: Model Training" in spite of 30 epochs. I think a cause is that an execution environment is different.

    What's your environment?

    My environment is as follows.

    • fairseq Version (e.g., 1.0 or master): 0.9.0
    • PyTorch Version (e.g., 1.0): 1.4.0
    • OS (e.g., Linux): Ubuntu 18.04.3 LTS
    • How you installed fairseq (pip, source): pip
    • Python version: 3.6.5
    • CUDA/cuDNN version: 10.0.130 / libcudnn.so.7.3.0
    • GPU models and configuration: Tesla V100-SXM2-16GB
    $ python collect_env.py 
    Collecting environment information...
    PyTorch version: 1.4.0
    Is debug build: No
    CUDA used to build PyTorch: 10.0
    
    OS: Ubuntu 18.04.3 LTS
    GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
    CMake version: version 3.10.2
    
    Python version: 3.6
    Is CUDA available: Yes
    CUDA runtime version: 10.0.130
    GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
    Nvidia driver version: 440.33.01
    cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.3.0
    
    Versions of relevant libraries:
    [pip] numpy==1.18.1
    [pip] torch==1.4.0
    [conda] blas                      1.0                         mkl  
    [conda] mkl                       2020.0                      166  
    [conda] mkl-service               2.3.0            py36he904b0f_0  
    [conda] mkl_fft                   1.0.15           py36ha843d7b_0  
    [conda] mkl_random                1.1.0            py36hd6b4f25_0  
    [conda] pytorch                   1.4.0           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
    [conda] torch                     1.4.0                    pypi_0    pypi
    

    A commit hash of the espresso is f933e8cfff38a20bd7ea76dd4cb3a1fa809eab83. An output log is as follows, but I can't find any problem.

    2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: I SEE I MUST GET SETTLED QUICKLY SO THAT I SHALL HAVE THE POWER TO RESTRAIN YOU THEY ROLLICKED FORTH THEN AND BOUGHT SEVERAL THINGS A BIG STEAMER RUG FOR THE CAR A PAIR OF LONG GRAY MOCHA GLOVES TO MATCH THE HAND BAG A SILK UMBRELLA
    2020-03-08 22:27:56 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND HAVE THAT' BE A IN AND I I CAN BE TO PLEASURE OF GET THE I ARE UPED AND AND THE THERE THE OF ANDPIECEGERER ANDG AND THE SHIPS FEW OF SHOES WHITE HAIRSACKS WHICH MAN OF PAIR HANDKERCHIEF
    2020-03-08 23:13:04 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.804 | nll_loss 5.856 | wer 109.669 | cer 100.424 | ppl 57.94 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 414000 | best_wer 96.7413
    2020-03-08 23:13:27 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint_25_414000.pt (epoch 25 @ 414000 updates, score 109.6687) (writing took 23.222585418028757 seconds)
    2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample REF: CONTINUED DUNCAN SPEAKING SLOWLY AND USING THE SIMPLEST FRENCH OF WHICH HE WAS THE MASTER TO BELIEVE THAT NONE OF THIS WISE AND BRAVE NATION UNDERSTAND THE LANGUAGE THAT THE GRAND MONARQUE USES WHEN HE TALKS TO HIS CHILDREN
    2020-03-08 23:13:28 | INFO | espresso.criterions.label_smoothed_cross_entropy_v2 | sample PRD: AND THECAN WITH IN AND IING THE WORDSST WAY LANGUAGE THE HE WAS THE MOST OF BE THAT HE OF THE WAS AND IN MAN COULDS LANGUAGE OF HE WORLDESTITTSS HE ISS TO THE PEOPLE
    2020-03-08 23:23:27 | INFO | train | epoch 025:  15999 / 16601 loss=6.624, nll_loss=5.649, ppl=50.18, wps=537.4, ups=0.75, wpb=716.6, bsz=16.9, num_updates=414424, lr=1e-05, gnorm=0.399, clip=0, oom=0, train_wall=13155, wall=554591
    2020-03-08 23:34:58 | INFO | train | epoch 025 | loss 6.624 | nll_loss 5.649 | ppl 50.18 | wps 539.9 | ups 0.75 | wpb 716.2 | bsz 16.9 | num_updates 415025 | lr 1e-05 | gnorm 0.4 | clip 0 | oom 0 | train_wall 13641 | wall 555281
    2020-03-08 23:37:49 | INFO | valid | epoch 025 | valid on 'valid' subset | loss 6.803 | nll_loss 5.856 | wer 108.198 | cer 99.5523 | ppl 57.92 | wps 822.3 | wpb 715.6 | bsz 29.1 | num_updates 415025 | best_wer 96.7413
    2020-03-08 23:38:12 | INFO | fairseq.checkpoint_utils | saved checkpoint exp/lstm/checkpoint25.pt (epoch 25 @ 415025 updates, score 108.1984) (writing took 23.870150407077745 seconds)
    

    Because the librispeech is very large dataset, I struggle with debugging. Could you give me any hint?

    I think if the espresso has a recipe for a small dataset like the an4 recipe in the espnet, a trial run is easier. Do you have any plan to implement a recipe for a small dataset?

    thanks.

    question 
    opened by ken57 8
  • Error found when running librispeech recipe with latest version of espresso

    Error found when running librispeech recipe with latest version of espresso

    🐛 Bug

    There are two issues after install the latest version of espresso:

    1. The specaug parameter parsing errro occur once we enable the specaug function
    2020-11-11 12:04:42 | INFO | espresso.speech_train | --max-tokens is the maximum number of input frames in a batch
    Traceback (most recent call last):
      File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 415, in <module>
        cli_main()
      File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/examples/asr_librispeech/../../espresso/speech_train.py", line 404, in cli_main
        cfg = convert_namespace_to_omegaconf(args)
      File "/nfs/mercury-13/u20/cli/src/espresso-11112020/espresso/fairseq/dataclass/utils.py", line 324, in convert_namespace_to_omegaconf
        composed_cfg = compose("config", overrides=overrides, strict=False)
      File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/experimental/compose.py", line 31, in compose
        cfg = gh.hydra.compose_config(
      File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 507, in compose_config
        cfg = self.config_loader.load_configuration(
      File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration
        return self._load_configuration(
      File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 180, in _load_configuration
        parsed_overrides = parser.parse_overrides(overrides=overrides)
      File "/nfs/mercury-13/u20/cli/miniconda3/envs/espresso-11112020/lib/python3.8/site-packages/hydra/core/override_parser/overrides_parser.py", line 95, in parse_overrides
        raise OverrideParseException(
    hydra.errors.OverrideParseException: mismatched input 'W' expecting <EOF>
    See https://hydra.cc/docs/next/advanced/override_grammar/basic for details
    
    1. It crash in model training step (step 8) without any error
    2020-11-11 12:38:55 | INFO | espresso.speech_train | task: SpeechRecognitionEspressoTask
    2020-11-11 12:38:55 | INFO | espresso.speech_train | model: SpeechLSTMModel
    2020-11-11 12:38:55 | INFO | espresso.speech_train | criterion: LabelSmoothedCrossEntropyV2Criterion)
    2020-11-11 12:38:55 | INFO | espresso.speech_train | num. model params: 159660204 (num. trained: 159660204)
    2020-11-11 12:38:55 | INFO | fairseq.trainer | detected shared parameter: decoder.attention.query_proj.bias <- decoder.attention.value_proj.bias
    2020-11-11 12:38:55 | INFO | espresso.speech_train | training on 1 devices (GPUs/TPUs)
    2020-11-11 12:38:55 | INFO | espresso.speech_train | max tokens per GPU = 26000 and batch size per GPU = 24
    2020-11-11 12:38:55 | INFO | fairseq.trainer | no existing checkpoint found exp/lstm_wsj.specaug.bpe1k/checkpoint_last.pt
    2020-11-11 12:38:55 | INFO | fairseq.trainer | loading train data for epoch 1
    2020-11-11 12:39:05 | INFO | espresso.tasks.speech_recognition | /nfs/mercury-13/u20/cli/src/espresso.latest/espresso/examples/asr_librispeech/data-bulgarian-bpe1k/train.json 33004 examples
    ./run.sh: line 259:  4839 Segmentation fault      CUDA_VISIBLE_DEVICES=$free_gpu speech_train.py $data_dir --task speech_recognition_espresso --seed 1 --log-interval $((8000/ngpus/update_freq)) --log-format simple --print-training-sample-interval $((4000/ngpus/update_freq)) --num-workers 0 --data-buffer-size 0 --max-tokens 26000 --batch-size 24 --curriculum 1 --empty-cache-freq 50 --valid-subset $valid_subset --batch-size-valid 48 --ddp-backend no_c10d --update-freq $update_freq --distributed-world-size $ngpus --optimizer adam --lr 0.001 --weight-decay 0.0 --clip-norm 2.0 --save-dir $dir --restore-file checkpoint_last.pt --save-interval-updates $((6000/ngpus/update_freq)) --keep-interval-updates 3 --keep-last-epochs 5 --validate-interval 1 --best-checkpoint-metric wer --criterion label_smoothed_cross_entropy_v2 --label-smoothing 0.1 --smoothing-type uniform --dict $dict --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model --max-source-positions 9999 --max-target-positions 999 $opts --specaugment-config "$specaug_config" 2>&1
    

    To Reproduce

    Steps to reproduce the behavior (always include the command you ran):

    1. Run cmd: ./run.sh
    2. See error: listed above

    Expected behavior

    Able to train model with the recipe

    Environment

    • fairseq Version (e.g., 1.0 or master): 1.0.0a0+d966482
    • PyTorch Version (e.g., 1.0): 1.4.0
    • OS (e.g., Linux): CentOS Linux release 7.7.1908 (Core)
    • How you installed fairseq (pip, source): pip install from source
    • Build command you used (if compiling from source): pip install --editable .
    • Python version: 3.8.5
    • CUDA/cuDNN version: py3.8_cuda10.0.130_cudnn7.6.3_0
    • GPU models and configuration:
    • Any other relevant information:

    Additional context

    bug 
    opened by PhenixCFLi 7
  • Verify WER by scoring with Kaldi

    Verify WER by scoring with Kaldi

    Hi authors, I'm using Librispeech run.sh recipe. I trained the acoustic model (speech_conv_lstm_librispeech) using 4 GPU 1080ti But I'm facing this error while doing kaldi scoring. local/score.sh data/test_clean exp/lstm/decode_test_clean_shallow_fusion run.pl: job failed, log is in exp/lstm/decode_test_clean_shallow_fusion/scoring_kaldi/log/score.log My second question Is there any documentation for using my pre-trained model to decode audio wav, i would like to compare the decoding speed between ESPNet and Esspresso https://arxiv.org/abs/1909.08723

    question 
    opened by ahmedalbahnasawy 7
  • Slow training...

    Slow training...

    Hello,

    I have spent some time to compare pychain LF-MMI in Espresso and the pychain_example, which seems to borrow some code from Espresso. I get very slow forward passes in Espresso while they are much faster in PyChain_example (I use DistributedDataParallel on both Espresso ('no_c10d' backend, which uses NCCL anyway?) and PyChain (with 'nccl') ). I use a TDNN model which is the same, same architecture cnn/bn/relu implementation matched from Espresso to PyChain: 6 TDNN+BN+ReLU layers, strides=(1,1,1,1,1,3), dilation=(1,1,1,3,3,3), kernels=(3,3,3,3,3,3), no residual connections. Both use curriliculum learning in the first epoch and start with shortest batches.

    Espresso code:

            start= time.time()
            for i in range(len(self.tdnn)):
                if self.residual and i > 0:  # residual connection starts from the 2nd layer
                    prev_x = x
                x, x_lengths, padding_mask = self.tdnn[i](x, x_lengths)
                x = self.dropout_out_module(x)
                x = x + prev_x if self.residual and i > 0 and x.size(1) == prev_x.size(1) else x
            print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())
    

    PyChain code:

            start = time.time()
            for i in range(len(self.tdnn)):
                if self.residual and i>0:
                  x_prev = x
                x, x_lengths = self.tdnn[i](x, x_lengths)
                x = F.dropout(x, p=self.dropout, training=self.training)
                if self.residual and i>0 and x.size(1)==x_prev.size(1):
                    x += x_prev
            print ('6xTDNN time %.5fs' % (time.time() - start,), 'tensor_in_size', s, 'gpu', x.get_device())
    

    So, the code is almost line-by-line the same, architecture is the same. Yet, after using DistributedDataParallel, Espresso is much slower. This was run on the same machine, same 2 GPUs, one experiment right after the other (so no load change issues on the machine). I checked that the computing the padding does not significantly affect the timing. Here are the timings for several forward passess of similar size.

    Espresso: 6xTDNN time 2.42642s tensor_in_size torch.Size([64, 158, 40]) tensor_out_size torch.Size([64, 53, 640]) gpu 1 6xTDNN time 2.39317s tensor_in_size torch.Size([64, 177, 40]) tensor_out_size torch.Size([64, 59, 640]) gpu 1 6xTDNN time 1.95155s tensor_in_size torch.Size([64, 144, 40]) tensor_out_size torch.Size([64, 48, 640]) gpu 0 6xTDNN time 2.50637s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0 6xTDNN time 1.79735s tensor_in_size torch.Size([64, 192, 40]) tensor_out_size torch.Size([64, 64, 640]) gpu 1 6xTDNN time 2.37481s tensor_in_size torch.Size([64, 186, 40]) tensor_out_size torch.Size([64, 62, 640]) gpu 0

    ... PyChain: 6xTDNN time 0.07956s tensor_in_size torch.Size([64, 170, 40]) tensor_out_size torch.Size([64, 57, 640]) gpu 0 6xTDNN time 0.08923s tensor_in_size torch.Size([64, 194, 40]) tensor_out_size torch.Size([64, 65, 640]) gpu 1 6xTDNN time 0.08312s tensor_in_size torch.Size([64, 211, 40]) tensor_out_size torch.Size([64, 71, 640]) gpu 0 6xTDNN time 0.08275s tensor_in_size torch.Size([64, 224, 40]) tensor_out_size torch.Size([64, 75, 640]) gpu 1 6xTDNN time 0.08598s tensor_in_size torch.Size([64, 233, 40]) tensor_out_size torch.Size([64, 78, 640]) gpu 0 6xTDNN time 0.08788s tensor_in_size torch.Size([64, 241, 40]) tensor_out_size torch.Size([64, 81, 640]) gpu 1 ... So, PyChain is 10-20 times faster... Espresso uses 40-50% of each GPU, while PyChain uses 85-95% when put together with the LF-MMI loss. I wonder how to make Espresso train as fast as PyChain shows it is possible. Is it a matter of the DistributedDataParallel imlementation in Fairseq? the backend? Any help is welcome.

    question 
    opened by maff20 6
  • SHA hashes in 'main' branch are different from those in the 'origin/main'

    SHA hashes in 'main' branch are different from those in the 'origin/main'

    Hello, i am experimenting with Espresso, trying the switchboard recipe on air-traffic-control data. I noticed my local SHA hashes in 'main' branch are different from those in the 'origin/main'. I tried to 'pull' from the 'origin', but i am getting conflicts due to that.

    • Can it be caused by the local installation via pip install --editable . ?
    • Have you seen this issue before ? Is it normal, or did I do something wrong ?
    • How do you edit code, contribute and test locally normally ?

    Best regards, Karel

    question 
    opened by vesis84 2
  • Android Espresso not able to test fragement

    Android Espresso not able to test fragement

    ❓ Questions and Help

    Android Espresso not able to test fragement I am trying to launch a fragment as below

    override fun onCreateOptionsMenu(menu: Menu, inflater: MenuInflater) { inflater.inflate(R.menu.menu_home, menu)

    menuNotification.icon = NotificationHelper.getNotificationDrawable(UserPool.userId )
    

    What have you tried?

    private lateinit var homeFragmentScenario: FragmentScenario

    @MockK
    lateinit var mockPool: UserPool
    

    @Before fun setUp() { InjectMocksRule.createMockK(this) ActivityScenario.launch(MainActivity::class.java) homeFragmentScenario= launchFragmentInContainer(themeResId = R.style.AppTheme) homeFragmentScenario.moveToState(Lifecycle.State.STARTED) Intents.init()

    }

    @Test
    

    fun loadScreen() { every { mockPool.userId } answers {"123456"} Espresso.onView(ViewMatchers.withId(R.id.layout_home)) .check(ViewAssertions.matches(ViewMatchers.isDisplayed())) }

    question 
    opened by AbhishekArrk 0
  • hydra.errors.ConfigCompositionException: Could not override 'task.data'.

    hydra.errors.ConfigCompositionException: Could not override 'task.data'.

    👉 Please follow one of these issue templates 👈

    Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates. wenn I run the stage7 in run_torchaudio.sh.There is always a problem: hydra.errors.ConfigCompositionException: Could not override 'task.data'. Maybe the problem is in the python file hydra_train.python Could not override 'task.data'. To append to your config use +task.data=/espresso/examples/asr_librispeech/data Key 'data' is not in struct full_key: task.data reference_type=Any object_type=dict How can I solve it.

    opened by kai-dll 1
  • Batchnorm and masking

    Batchnorm and masking

    It looks like the batchnorm doesn't take into account the masking:

    https://github.com/freewym/espresso/blob/6fca6cacd9d475d2676c527999e2d1bde08e7cbb/espresso/models/speech_tdnn.py#L170

    Surely this isn't right? However I don't know how to take it into account.

    opened by danpovey 4
  • TIMIT Demo example

    TIMIT Demo example

    🚀 Feature Request

    Would it be possible to upload an example for TIMIT for demonstration purpose? All other Speech Recognition datasets are kinda too large to download when just trying out this repo. Having TIMIT would make allow people new to ASR to quickly try out and appreciate the convinience of this framework. Thanks.

    Motivation

    Pitch

    Alternatives

    Additional context

    enhancement help wanted 
    opened by jedyang97 1
Owner
Yiming Wang
Yiming Wang
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 86 Jun 11, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 3, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 3, 2023
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 7, 2022
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

Wang Yu 31 Aug 16, 2021
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

null 922 Dec 31, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

Ke Technologies 34 Sep 8, 2022
Rhasspy 673 Dec 28, 2022