Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Overview


OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recognition. We aim to make ASR technology easier to use for everyone.

OpenSpeech is backed by the two powerful libraries — PyTorch-Lightning and Hydra. Various features are available in the above two libraries, including Multi-GPU and TPU training, Mixed-precision, and hierarchical configuration management.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

What's New

Contents

What is OpenSpeech?

OpenSpeech is a framework for making end-to-end speech recognizers. End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.

For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.

Because of these advantages, many end-to-end speech recognition related open sources have emerged. But, Many of them are based on basic PyTorch or Tensorflow, it is very difficult to use various functions such as mixed-precision, multi-node training, and TPU training etc. However, with frameworks such as PyTorch-Lighting, these features can be easily used. So we have created a speech recognition framework that introduced PyTorch-Lightning and Hydra for easy use of these advanced features.

Why should I use OpenSpeech?

  1. PyTorch-Lighting base framework.
    • Various functions: mixed-precision, multi-node training, tpu training etc.
    • Models become hardware agnostic
    • Make fewer mistakes because lightning handles the tricky engineering
    • Lightning has dozens of integrations with popular machine learning tools.
  2. Easy-to-experiment with the famous ASR models.
    • Supports 20+ models and is continuously updated.
    • Low barrier to entry for educators and practitioners.
    • Save time for researchers who want to conduct various experiments.
  3. Provides recipes for the most widely used languages, English, Chinese, and + Korean.
    • LibriSpeech - 1,000 hours of English dataset most widely used in ASR tasks.
    • AISHELL-1 - 170 hours of Chinese Mandarin speech corpus.
    • KsponSpeech - 1,000 hours of Korean open-domain dialogue speech.
  4. Easily customize a model or a new dataset to your needs:
    • The default hparams of the supported models are provided but can be easily adjusted.
    • Easily create a custom model by combining modules that are already provided.
    • If you want to use the new dataset, you only need to define a pl.LightingDataModule and Tokenizer classes.
  5. Audio processing
    • Representative audio features such as Spectrogram, Mel-Spectrogram, Filter-Bank, and MFCC can be used easily.
    • Provides a variety of augmentation, including SpecAugment, Noise Injection, and Audio Joining.

Why shouldn't I use OpenSpeech?

  • This library provides code for training ASR models, but does not provide APIs by pre-trained models.
  • This library does not provides pre-trained models.

Model architectures

We support all the models below. Note that, the important concepts of the model have been implemented to match, but the details of the implementation may vary.

  1. DeepSpeech2 (from Baidu Research) released with paper Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, by Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, Zhenyao Zhu.
  2. RNN-Transducer (from University of Toronto) released with paper Sequence Transduction with Recurrent Neural Networks, by Alex Graves.
  3. LSTM Language Model (from RWTH Aachen University) released with paper LSTM Neural Networks for Language Modeling, by Martin Sundermeyer, Ralf Schluter, and Hermann Ney.
  4. Listen Attend Spell (from Carnegie Mellon University and Google Brain) released with paper Listen, Attend and Spell, by William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals.
  5. Location-aware attention based Listen Attend Spell (from University of Wrocław and Jacobs University and Universite de Montreal) released with paper Attention-Based Models for Speech Recognition, by Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio.
  6. Joint CTC-Attention based Listen Attend Spell (from Mitsubishi Electric Research Laboratories and Carnegie Mellon University) released with paper Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, by Suyoun Kim, Takaaki Hori, Shinji Watanabe.
  7. Deep CNN Encoder with Joint CTC-Attention Listen Attend Spell (from Mitsubishi Electric Research Laboratories and Massachusetts Institute of Technology and Carnegie Mellon University) released with paper Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM, by Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan.
  8. Multi-head attention based Listen Attend Spell (from Google) released with paper State-of-the-art Speech Recognition With Sequence-to-Sequence Models, by Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani.
  9. Speech-Transformer (from University of Chinese Academy of Sciences and Institute of Automation and Chinese Academy of Sciences) released with paper Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, by Linhao Dong; Shuang Xu; Bo Xu.
  10. VGG-Transformer (from Facebook AI Research) released with paper Transformers with convolutional context for ASR, by Abdelrahman Mohamed, Dmytro Okhonko, Luke Zettlemoyer.
  11. Transformer with CTC (from NTT Communication Science Laboratories, Waseda University, Center for Language and Speech Processing, Johns Hopkins University) released with paper Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration, by Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, Tomohiro Nakatani.
  12. Joint CTC-Attention based Transformer(from NTT Corporation) released with paper Self-Distillation for Improving CTC-Transformer-based ASR Systems, by Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix.
  13. Transformer Language Model (from Amazon Web Services) released with paper Language Models with Transformers, by Chenguang Wang, Mu Li, Alexander J. Smola.
  14. Jasper (from NVIDIA and New York University) released with paper Jasper: An End-to-End Convolutional Neural Acoustic Model, by Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde.
  15. QuartzNet (from NVIDIA and Univ. of Illinois and Univ. of Saint Petersburg) released with paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions, by Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang.
  16. Transformer Transducer (from Facebook AI) released with paper Transformer-Transducer: End-to-End Speech Recognition with Self-Attention, by Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer.
  17. Conformer (from Google) released with paper Conformer: Convolution-augmented Transformer for Speech Recognition, by Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang.
  18. Conformer with CTC (from Northwestern Polytechnical University and University of Bordeaux and Johns Hopkins University and Human Dataware Lab and Kyoto University and NTT Corporation and Shanghai Jiao Tong University and Chinese Academy of Sciences) released with paper Recent Developments on ESPNET Toolkit Boosted by Conformer, by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang.
  19. Conformer with LSTM Decoder (from IBM Research AI) released with paper On the limit of English conversational speech recognition, by Zoltán Tüske, George Saon, Brian Kingsbury.
  20. ContextNet (from Google) released with paper ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context, by Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu.

Get Started

We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically. If you want to know how we used Hydra, we recommend you to read here.

Supported Datasets

We support LibriSpeech, KsponSpeech, and AISHELL-1.

LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data was derived from reading audiobooks from the LibriVox project, and has been carefully segmented and aligned.

Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. 400 people from different accent areas in China were invited to participate in the recording, which was conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz.

KsponSpeech is a large-scale spontaneous speech corpus of Korean. This corpus contains 969 hours of general open-domain dialog utterances, spoken by about 2,000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. To start training, the KsponSpeech dataset must be prepared in advance. To download KsponSpeech, you need permission from AI Hub.

Pre-processed Manifest Files

Dataset Unit Manifest Vocab SP-Model
LibriSpeech character [Link] [Link] -
LibriSpeech subword [Link] [Link] [Link]
AISHELL-1 character [Link] [Link] -
KsponSpeech character [Link] [Link] -
KsponSpeech subword [Link] [Link] [Link]
KsponSpeech grapheme [Link] [Link] -

KsponSpeech needs permission from AI Hub.
Please send e-mail including the approved screenshot to [email protected].

Manifest File

  • Acoustic model manifest file format:
LibriSpeech/test-other/8188/269288/8188-269288-0052.flac        ▁ANNIE ' S ▁MANNER ▁WAS ▁VERY ▁MYSTERIOUS       4039 20 5 531 17 84 2352
LibriSpeech/test-other/8188/269288/8188-269288-0053.flac        ▁ANNIE ▁DID ▁NOT ▁MEAN ▁TO ▁CONFIDE ▁IN ▁ANYONE ▁THAT ▁NIGHT ▁AND ▁THE ▁KIND EST ▁THING ▁WAS ▁TO ▁LEAVE ▁HER ▁A LONE    4039 99 35 251 9 4758 11 2454 16 199 6 4 323 200 255 17 9 370 30 10 492
LibriSpeech/test-other/8188/269288/8188-269288-0054.flac        ▁TIRED ▁OUT ▁LESLIE ▁HER SELF ▁DROPP ED ▁A SLEEP        1493 70 4708 30 115 1231 7 10 1706
LibriSpeech/test-other/8188/269288/8188-269288-0055.flac        ▁ANNIE ▁IS ▁THAT ▁YOU ▁SHE ▁CALL ED ▁OUT        4039 34 16 25 37 208 7 70
LibriSpeech/test-other/8188/269288/8188-269288-0056.flac        ▁THERE ▁WAS ▁NO ▁REPLY ▁BUT ▁THE ▁SOUND ▁OF ▁HURRY ING ▁STEPS ▁CAME ▁QUICK ER ▁AND ▁QUICK ER ▁NOW ▁AND ▁THEN ▁THEY ▁WERE ▁INTERRUPTED ▁BY ▁A ▁GROAN     57 17 56 1368 33 4 489 8 1783 14 1381 133 571 49 6 571 49 82 6 76 45 54 2351 44 10 3154
LibriSpeech/test-other/8188/269288/8188-269288-0057.flac        ▁OH ▁THIS ▁WILL ▁KILL ▁ME ▁MY ▁HEART ▁WILL ▁BREAK ▁THIS ▁WILL ▁KILL ▁ME 299 46 71 669 50 41 235 71 977 46 71 669 50
...
...

Training examples

You can simply train with LibriSpeech dataset like below:

  • Example1: Train the conformer-lstm model with filter-bank features on GPU.
$ python ./openspeech_cli/hydra_train.py \
    dataset=librispeech \
    dataset.dataset_download=True \
    dataset.dataset_path=$DATASET_PATH \
    dataset.manifest_file_path=$MANIFEST_FILE_PATH \  
    tokenizer=libri_subword \
    model=conformer_lstm \
    audio=fbank \
    lr_scheduler=warmup_reduce_lr_on_plateau \
    trainer=gpu \
    criterion=joint_ctc_cross_entropy

You can simply train with KsponSpeech dataset like below:

  • Example2: Train the listen-attend-spell model with mel-spectrogram features On TPU:
$ python ./openspeech_cli/hydra_train.py \
    dataset=ksponspeech \
    dataset.dataset_path=$DATASET_PATH \
    dataset.manifest_file_path=$MANIFEST_FILE_PATH \  
    dataset.test_dataset_path=$TEST_DATASET_PATH \
    dataset.test_manifest_dir=$TEST_MANIFEST_DIR \
    tokenizer=kspon_character \
    model=listen_attend_spell \
    audio=melspectrogram \
    lr_scheduler=warmup_reduce_lr_on_plateau \
    trainer=tpu \
    criterion=joint_ctc_cross_entropy

You can simply train with AISHELL-1 dataset like below:

  • Example3: Train the quartznet model with mfcc features On GPU with FP16:
$ python ./openspeech_cli/hydra_train.py \
    dataset=aishell \
    dataset.dataset_path=$DATASET_PATH \
    dataset.dataset_download=True \
    dataset.manifest_file_path=$MANIFEST_FILE_PATH \  
    tokenizer=aishell_character \
    model=quartznet15x5 \
    audio=mfcc \
    lr_scheduler=warmup_reduce_lr_on_plateau \
    trainer=gpu-fp16 \
    criterion=ctc

Evaluation examples

  • Example1: Evaluation the listen_attend_spell model:
$ python ./openspeech_cli/hydra_eval.py \
    audio=melspectrogram \
    eval.model_name=listen_attend_spell \
    eval.dataset_path=$DATASET_PATH \
    eval.checkpoint_path=$CHECKPOINT_PATH \
    eval.manifest_file_path=$MANIFEST_FILE_PATH  
  • Example2: Evaluation the listen_attend_spell, conformer_lstm models with ensemble:
$ python ./openspeech_cli/hydra_eval.py \
    audio=melspectrogram \
    eval.model_names=(listen_attend_spell, conformer_lstm) \
    eval.dataset_path=$DATASET_PATH \
    eval.checkpoint_paths=($CHECKPOINT_PATH1, $CHECKPOINT_PATH2) \
    eval.ensemble_weights=(0.3, 0.7) \
    eval.ensemble_method=weighted \
    eval.manifest_file_path=$MANIFEST_FILE_PATH  

Language model training example

Language model training requires only data to be prepared in the following format:

openspeech is a framework for making end-to-end speech recognizers.
end to end automatic speech recognition is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits.
because of these advantages, many end-to-end speech recognition related open sources have emerged.
...
...

Note that you need to use the same vocabulary as the acoustic model.

  • Example: Train the lstm_lm model:
$ python ./openspeech_cli/hydra_lm_train.py \
    dataset=lm \
    dataset.dataset_path=../../../lm.txt \
    tokenizer=kspon_character \
    tokenizer.vocab_path=../../../labels.csv \
    model=lstm_lm \
    lr_scheduler=tri_stage \
    trainer=gpu \
    criterion=perplexity

Installation

This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

  • numpy: pip install numpy (Refer here for problem installing Numpy).
  • pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
  • librosa: conda install -c conda-forge librosa (Refer here for problem installing librosa)
  • torchaudio: pip install torchaudio==0.6.0 (Refer here for problem installing torchaudio)
  • sentencepiece: pip install sentencepiece (Refer here for problem installing sentencepiece)
  • pytorch-lightning: pip install pytorch-lightning (Refer here for problem installing pytorch-lightning)
  • hydra: pip install hydra-core --upgrade (Refer here for problem installing hydra)
  • warp-rnnt: Refer to warp-rnnt page to install the library.
  • ctcdecode: Refer to ctcdecode page to install the library.

Install from pypi

You can install OpenSpeech with pypi.

pip install openspeech-core

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:

$ ./install.sh

Install Apex (for 16-bit training)

For faster training install NVIDIA's apex library:

$ git clone https://github.com/NVIDIA/apex
$ cd apex

# ------------------------
# OPTIONAL: on your cluster you might need to load CUDA 10 or 9
# depending on how you installed PyTorch

# see available modules
module avail

# load correct CUDA before install
module load cuda-10.0
# ------------------------

# make sure you've loaded a cuda version > 4.0 and < 7.0
module load gcc-6.1.0

$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

License

This project is licensed under the MIT LICENSE - see the LICENSE.md file for details

Citation

If you use the system for academic work, please cite:

@GITHUB{2021-OpenSpeech,
  author       = {Kim, Soohwan and Ha, Sangchun and Cho, Soyoung},
  author email = {[email protected], [email protected], [email protected]}
  title        = {OpenSpeech: Open-Source Toolkit for End-to-End Speech Recognition},
  howpublished = {\url{https://github.com/openspeech-team/openspeech}},
  docs         = {\url{https://openspeech-team.github.io/openspeech}},
  year         = {2021}
}
Comments
  • Need help in training with hyper-parameters

    Need help in training with hyper-parameters

    I am interested in training this model which implements the below paper: Conformer: Convolution-augmented Transformer for Speech Recognition

    I believe the example 1 with conformer-lstm would train this model. Is my assumption right?

    Also, according to paper there are 3 different models that vary according to the hyper-parameter and I am interested in the "s" version of the model having around 10M parameters. How can I select this version or provide the hyper-parameters?

    Screenshot from 2021-07-28 14-37-40

    Also, do you have any script for end-to-end inference (including language model)?

    GOOD FIRST ISSUE QUESTION 
    opened by debasish-mihup 20
  • vocab keyError

    vocab keyError

    안녕하세요, vocab_label.csv가 생성되지않아서 이 csv파일은 KoSpeech에서 사용하던 aihub_character_labels.csv파일을 대체하여 사용하였습니다. 그런데 KeyError가 발생하게 되어 질문드립니다. 혹시 openspeech에서 ./hydra_train.py를 실행하면 자동으로 _label.csv 파일이 생성되나요? image

    QUESTION 
    opened by miziworld 17
  • RuntimeWarning: invalid value encountered in true_divide

    RuntimeWarning: invalid value encountered in true_divide

    ❓ Questions & Help

    339.896 Total estimated model params size (MB) Epoch 0: 1%|▎ | 700/101875 [05:16<12:42:03, 2.21it/s, loss=3.19, v_num=0, train_loss=3.300, train_cross_entropy_loss=3.530, train_ctc_loss=2.750, train_wer=1.000, train_cer=1.020] /home/bearking/voice_recognition/openspeech/openspeech/data/audio/dataset.py:175: RuntimeWarning: invalid value encountered in true_divide feature /= np.std(feature) Epoch 0: 1%|▍ | 825/101875 [06:10<12:37:20, 2.22it/s, loss=nan, v_num=0, train_loss=nan.0, train_cross_entropy_loss=nan.0, train_ctc_loss=nan.0, train_wer=1.000, train_cer=2.610]

    warning이 발생하면 그 이후로 로스는 nan이 되어버리고 cer은 계속 증가하네요

    이런 상황에서는 데이터의 어떤 부분을 체크해봐야하나요? 다른 라이브러리에서 학습할때는 이런 문제 없이 학습이 잘 되었습니다..

    Details

    커맨드는 이렇습니다.

    python hydra_train.py
    dataset=ksponspeech
    dataset.dataset_path=/home/bearking/voice_file/Training/AI_1
    dataset.manifest_file_path=/home/bearking/voice_recognition/KoreanSTT/transcripts.txt
    dataset.test_dataset_path=/home/hdd/young_voice/Validation
    dataset.test_manifest_dir=/home/hdd/young_voice/Validation
    tokenizer=kspon_character
    model=joint_ctc_listen_attend_spell
    audio=melspectrogram
    lr_scheduler=warmup_reduce_lr_on_plateau
    trainer=gpu
    criterion=joint_ctc_cross_entropy

    QUESTION 
    opened by JaeungHyun 14
  • LexerNoViableAltException 로 인해 dataset.dataset_path를 지정할 수 없습니다.

    LexerNoViableAltException 로 인해 dataset.dataset_path를 지정할 수 없습니다.

    ❓ Questions & Help

    LexerNoViableAltException 로 인해 dataset.dataset_path를 지정할 수 없습니다. 로그: https://www.toptal.com/developers/hastebin/tavosicufa.sql

    Details

    데이터셋의 위치는 환경변수를 거치지 않고 절대 경로를 사용하여 하였습니다. 시스템 환경은 Kubuntu 21.10 입니다. 다른 이슈 페이지 들러보면서 인수가 어떻게 다른가를 점검해 보았으나 무엇이 다른지 잘 모르겠습니다. hydra_train.py 를 직접 수정하여 인수를 직접 정의하면 작동은 됩니다.

    직접 정의한 환경변수는 HYDRA_FULL_ERROR=1 가 유일합니다

    QUESTION 
    opened by tmvkrpxl0 12
  • UnicodeDecodeError when reading ksponspeech manifest file

    UnicodeDecodeError when reading ksponspeech manifest file

    Environment info

    • Platform: Ubuntu 18.04
    • Python version: 3.8.3
    • PyTorch version (GPU?): 1.7.0
    • Using GPU in script?: Yes

    Information

    Model I am using (ListenAttendSpell, Transformer, Conformer ...):

    The problem arises when using:

    • [v] the official example scripts: (give details below)
    • [x] my own modified scripts: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    1. Follow the script below. I set required files accordingly based on README.
    root@elf-desktop:/home/jun1.oh/workspace/openspeech# cat train_kspon.sh
    #!/bin/bash
    python ./openspeech_cli/hydra_train.py \
            dataset=ksponspeech \
        dataset.dataset_path=/mnt/dataset/KsponSpeech/kspon/kspon/ \
        dataset.manifest_file_path=/mnt/dataset/KsponSpeech/kspon/kspon_character_manifest.txt \
        dataset.test_dataset_path=/mnt/dataset/KsponSpeech/kspon/kspon_eval/ \
        dataset.test_manifest_dir=/mnt/dataset/KsponSpeech/kspon/scripts/ \
        tokenizer=kspon_character \
        model=joint_ctc_listen_attend_spell \
        audio=melspectrogram \
        lr_scheduler=warmup_reduce_lr_on_plateau \
        trainer=gpu \
        criterion=joint_ctc_cross_entropy
    

    Expected behavior

    My code works well with LibriSpeech dataset. I am trying to run the code with KsponSpeech dataset now. However, it is giving errors while reading a kspon_character_manifest.txt file with errors below.

    [2021-09-13 10:37:41,028][openspeech.utils][INFO] - PyTorch version : 1.7.0
    Error executing job with overrides: ['dataset=ksponspeech', 'dataset.dataset_path=/mnt/dataset/KsponSpeech/kspon/kspon/', 'dataset.manifest_file_path=/mnt/dataset/KsponSpeech/kspon/kspon_character_manifest.txt', 'dataset.test_dataset_pat
    h=/mnt/dataset/KsponSpeech/kspon/kspon_eval/', 'dataset.test_manifest_dir=/mnt/dataset/KsponSpeech/kspon/scripts/', 'tokenizer=kspon_character', 'model=joint_ctc_listen_attend_spell', 'audio=melspectrogram', 'lr_scheduler=warmup_reduce_l
    r_on_plateau', 'trainer=gpu', 'criterion=joint_ctc_cross_entropy']
    Traceback (most recent call last):
      File "./openspeech_cli/hydra_train.py", line 60, in <module>
        hydra_main()
      File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 48, in decorated_main
        _run_hydra(
      File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
        run_and_report(
      File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
        raise ex
      File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
        return func()
      File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
        lambda: hydra.run(
      File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
        _ = ret.return_value
      File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
        raise self._return_value
      File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
        ret.return_value = task_function(task_cfg)
      File "./openspeech_cli/hydra_train.py", line 48, in hydra_main
        data_module.setup(tokenizer=tokenizer)
      File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn
        fn(*args, **kwargs)
      File "/home/jun1.oh/workspace/openspeech/openspeech/datasets/ksponspeech/lit_data_module.py", line 156, in setup
        audio_paths, transcripts = self._parse_manifest_file()
      File "/home/jun1.oh/workspace/openspeech/openspeech/datasets/ksponspeech/lit_data_module.py", line 120, in _parse_manifest_file
        for idx, line in enumerate(f.readlines()):
      File "/opt/conda/lib/python3.8/codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 55: invalid start byte
    

    When I tried to set 'cp949' encoding instead of 'utf-8', it clears this issue but generates another errors so I think it is better to debug this one first.

    Thanks, Daniel

    BUG QUESTION 
    opened by jun-danieloh 10
  • Can't run the training examples

    Can't run the training examples

    ❓ Questions & Help

    I tried to learn how to use openspeech but encountered the following error message. Thank you for your answer.

    (1) omegaconf.errors.ConfigAttributeError: Key 'model' is not in struct full_key: model object_type=dict

    (2) It shows there are "No such file or directory" for the setttings of dataset.dataset_path and dataset.manifest_file_path. But the directory and file are in the correct position.

    dataset.dataset_path="../../../../LibriSpeech/"
    dataset.manifest_file_path="./openspeech/datasets/librispeech/libri_subword_manifest.txt" \

    Details

    Below are my training script and error message.

    ===================== training script

    python ./openspeech_cli/hydra_train.py
    dataset="librispeech"
    dataset.dataset_download=False
    dataset.dataset_path="../../../../LibriSpeech/"
    dataset.manifest_file_path="./openspeech/datasets/librispeech/libri_subword_manifest.txt"
    tokenizer=libri_subword
    model="conformer_lstm"
    audio=fbank
    lr_scheduler=warmup_reduce_lr_on_plateau
    trainer=gpu
    criterion=cross_entropy

    ===================error /Desktop/CodeFolder/ASR/openspeech/openspeech/utils.py:88: FutureWarning: Pass y=[ 1.0289366e-05 1.9799588e-06 2.5269967e-06 ... 4.2585389e-06 -7.8615230e-06 -1.8652887e-05] as keyword args. From version 0.10 passing these as positional arguments will result in an error DUMMY_FEATURES = librosa.feature.melspectrogram(DUMMY_SIGNALS, n_mels=80) ./openspeech_cli/hydra_train.py:37: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path=os.path.join("..", "openspeech", "configs"), config_name="train") /home/docker/.local/lib/python3.8/site-packages/hydra/core/default_element.py:124: UserWarning: In 'train': Usage of deprecated keyword in package header '# @package group'. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/changes_to_package_header for more information deprecation_warning( /home/docker/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( augment: apply_spec_augment: false apply_noise_augment: false apply_joining_augment: false apply_time_stretch_augment: false freq_mask_para: 27 freq_mask_num: 2 time_mask_num: 4 noise_dataset_dir: None noise_level: 0.7 time_stretch_min_rate: 0.7 time_stretch_max_rate: 1.4 dataset: dataset: librispeech dataset_path: ??? dataset_download: false manifest_file_path: ../../../LibriSpeech/libri_subword_manifest.txt trainer: seed: 1 accelerator: dp accumulate_grad_batches: 1 num_workers: 4 batch_size: 32 check_val_every_n_epoch: 1 gradient_clip_val: 5.0 logger: wandb max_epochs: 20 save_checkpoint_n_steps: 10000 auto_scale_batch_size: binsearch sampler: else name: gpu device: gpu use_cuda: true auto_select_gpus: true

    Global seed set to 1 [2022-12-23 03:29:47,326][openspeech.utils][INFO] - augment: apply_spec_augment: false apply_noise_augment: false apply_joining_augment: false apply_time_stretch_augment: false freq_mask_para: 27 freq_mask_num: 2 time_mask_num: 4 noise_dataset_dir: None noise_level: 0.7 time_stretch_min_rate: 0.7 time_stretch_max_rate: 1.4 dataset: dataset: librispeech dataset_path: ??? dataset_download: false manifest_file_path: ../../../LibriSpeech/libri_subword_manifest.txt trainer: seed: 1 accelerator: dp accumulate_grad_batches: 1 num_workers: 4 batch_size: 32 check_val_every_n_epoch: 1 gradient_clip_val: 5.0 logger: wandb max_epochs: 20 save_checkpoint_n_steps: 10000 auto_scale_batch_size: binsearch sampler: else name: gpu device: gpu use_cuda: true auto_select_gpus: true

    [2022-12-23 03:29:47,373][openspeech.utils][INFO] - Operating System : Linux 5.15.0-52-generic [2022-12-23 03:29:47,374][openspeech.utils][INFO] - Processor : x86_64 [2022-12-23 03:29:47,375][openspeech.utils][INFO] - device : NVIDIA GeForce RTX 3090 [2022-12-23 03:29:47,375][openspeech.utils][INFO] - device : NVIDIA GeForce RTX 3090 [2022-12-23 03:29:47,375][openspeech.utils][INFO] - CUDA is available : True [2022-12-23 03:29:47,375][openspeech.utils][INFO] - CUDA version : 11.3 [2022-12-23 03:29:47,375][openspeech.utils][INFO] - PyTorch version : 1.10.0+cu113 Error executing job with overrides: ['dataset=librispeech', 'dataset.dataset_download=False'] Traceback (most recent call last): File "./openspeech_cli/hydra_train.py", line 42, in hydra_main logger, num_devices = parse_configs(configs) File "/Desktop/CodeFolder/ASR/openspeech/openspeech/utils.py", line 217, in parse_configs project=f"{configs.model.model_name}-{configs.dataset.dataset}", omegaconf.errors.ConfigAttributeError: Key 'model' is not in struct full_key: model object_type=dict

    Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. ./asr.sh: line 6: dataset.dataset_path=../../../../LibriSpeech/: No such file or directory ./asr.sh: line 8: dataset.manifest_file_path=./openspeech/datasets/librispeech/libri_subword_manifest.txt: No such file or directory

    opened by ccyang123 9
  • AttributeError: 'NoneType' object has no attribute 'sos_id'

    AttributeError: 'NoneType' object has no attribute 'sos_id'

    ❓ Questions & Help

    초심자여서 기초적인 질문이 존재할 수 있는 점 양해 부탁드립니다..! kospeech에서 제공해주던 모델 그대로 conformer-large 를 이용하여 학습을 진행하다가 loss nan으로 인해 학습을 멈추고 openspeech를 사용하여 음성인식기를 제작해보려고 했습니다.

    먼저, 실행 명령은 readme에 나와있는 ksponspeech 데이터 학습과 동일하게 실행하였습니다. train_execute

    DATASET_PATH 에는 aihub에서 받은 KsponSpeech_01 ... 05 폴더가 위치해있으며 MANIFEST_FILE_PATH 에는 이메일을 통해 받은 kspon_character_manifest.txt 파일 경로가 저장되어있습니다. TEST_DATASET_PATH 에는 aihub에서 받은 eval_clean과 eval_other 폴더가 위치해있으며 TEST_MANIFEST_DIR 에는 역시 이메일을 통해 받은 kspon_character_manifest.txt가 위치한 폴더 명을 저장해두었습니다.

    먼저, 이 지점에서 질문을 드리고 싶은 것이 각 패스가 의미하는 파일들이 올바르게 저장되어 있는지 여쭤보고 싶습니다.

    깃 클론 이후 했던 과정을 전부 나열해드리면 syntaxerror 로 인하여 openspeech/callbacks.py 에서

    callbacks

    로 수정하였고, 이후 재실행하여 OSError: Character label file (csv format) doesn't exist ../../../aihub_labels.csv 가 발생하였습니다. log1 log2 log3 log4

    때문에 ./opensppech/tokenizers/ksponspeech/character.py 에서 character

    로 절대경로로 설정해주었습니다. kspon_character_labels.csv 파일은 aihub 승인 스크린샷을 보내드린 후 받은 파일과 동일합니다. 이후 재실행하였더니 last_log

    위와 같은 결과가 나왔습니다. 여기서 다시 질문을 드리고 싶은 점이, 위에 했던 과정에서 오류를 일으킬 만한 점이 있었는지와 기본 설정된 aihub_labels.csv 파일이 제가 설정해 준 kspon_character_labels.csv 파일이 맞는지, 아니라면 aihub_labels.csv 파일은 무엇이며 어떻게 얻을 수 있는지,

    마지막으로 위 과정에서 특별한 문제가 없다면, sos_id attribute에 대한 error는 어떻게 처리해야 하는 지에 관해서 입니다.

    다소 글이 길어진 점 죄송합니다. 사소한 것 하나가 크게 영향을 미칠까봐 건드렸던 모든 것을 정리해서 적어보았습니다. 답변 기다리겠습니다!

    QUESTION 
    opened by phm1231 9
  • 모델을 테스트할 수 없습니다.

    모델을 테스트할 수 없습니다.

    ❓ Questions & Help

    모델 훈련을 끝내고, hydra_train.py 에서 trainer.test(model, data_module, ckpt_path="/home/tmvkrpxl0/STT_KO/openspeech/outputs/2022-04-02/04-05-11/None/version_None/checkpoints/0_206666.ckpt") 를 통해 모델을 테스트하려고 하였으나 ValueError 로 인하여 테스트를 진행할 수 없습니다. 또한, hydra_eval.py 를 사용하여도 똑같습니다, ValueError 로 인해 테스트를 진행할 수 없습니다.

    hydra_eval.py

    python3 openspeech_cli/hydra_eval.py \
    audio=melspectrogram \
    eval.dataset_path=/home/tmvkrpxl0/KoSpeech_Data/speech \
    eval.checkpoint_path="/home/tmvkrpxl0/STT_KO/openspeech/outputs/2022-04-02/04-05-11/None/version_None/checkpoints/0_206666.ckpt" \
    eval.manifest_file_path=/home/tmvkrpxl0/STT_KO/openspeech/generated \
    model=deepspeech2 \
    tokenizer=kspon_character \
    tokenizer.vocab_path=/home/tmvkrpxl0/STT_KO/openspeech/aihub_labels.csv
    

    로 실행하였고, hydra_train.py

    python3 openspeech_cli/hydra_train.py \
    dataset=ksponspeech \
    dataset.dataset_path=/home/tmvkrpxl0/KoSpeech_Data/speech \
    dataset.manifest_file_path=/home/tmvkrpxl0/STT_KO/openspeech/generated \
    dataset.test_dataset_path=/home/tmvkrpxl0/KoSpeech_Data/speech/KsponSpeech_eval/ \
    dataset.test_manifest_dir=/home/tmvkrpxl0/KoSpeech_Data/script \
    tokenizer=kspon_character \
    model=deepspeech2 \
    audio=melspectrogram \
    lr_scheduler=warmup_reduce_lr_on_plateau \
    trainer=gpu-resume \
    trainer.max_epochs=1 \
    trainer.checkpoint_path="/home/tmvkrpxl0/STT_KO/openspeech/outputs/2022-04-02/04-05-11/None/version_None/checkpoints/0_206666.ckpt" \
    trainer.batch_size=6 \
    criterion=ctc
    

    로 실행하였습니다.

    스스로 문제를 해결하려고 이것저것 시도해 보던 도중, AIHUB 측에서 제공한 모든 테스트용 데이터 파일들의 크기가 (16의 배수) + 1 이더군요... 훈련용 데이터들은 파일의 크기가 전부 16의 배수이지만 테스트용 데이터들은 16의 배수 + 1 입니다.

    DONE 
    opened by tmvkrpxl0 8
  • __init__() missing 2 required positional arguments: 'configs' and 'tokenizer'

    __init__() missing 2 required positional arguments: 'configs' and 'tokenizer'

    when run openspeech_cli.hydra_eval.py, load_from_checkpoint method report this error:

    __init__() missing 2 required positional arguments: 'configs' and 'tokenizer'

    I think it is caused by that the checkpoit doesn't pickle 'configs' and 'tokennizer' params in model class.

    BUG QUESTION 
    opened by wl-junlin 8
  • How to preprocess datasets?

    How to preprocess datasets?

    ❓ Questions & Help

    I'm sorry to bother you, but I'm having some problems.

    I wonder how OpenSpeech preprocesses the dataset. I see preprocess, but I don't see start up. I wonder how I should use it to get Manifest.

    I sent an email, but I didn't get a reply.

    Details

    QUESTION 
    opened by wblwty 7
  • gpu-resume 관련해서

    gpu-resume 관련해서

    ❓ Questions & Help

    gpu-resume 관련해서 질문이 있습니다. 지정해준 해당 체크포인트부터 학습이 진행될텐데 이전에 예를 들어 이전 초기 에폭을 10으로 설정했고, 만약 5에폭에서 중단되었다면, epoch 설정을 10으로 해야할까요 남은 에폭을 계산해서 설정해야할까요?

    Details

    QUESTION DONE 
    opened by MunJeongHyeon 7
  • Unsupported criterion: CTCLoss (error)

    Unsupported criterion: CTCLoss (error)

    ❓ Questions & Help

    When I changed the "criterion" from 'cross_entropy' to 'ctc', the following error occurs. (cross_entropy can work fine)

    File "/Desktop/CodeFolder/ASR/openspeech/openspeech/models/openspeech_encoder_decoder_model.py", line 97, in collect_outputs raise ValueError(f"Unsupported criterion: {self.criterion}") ValueError: Unsupported criterion: CTCLoss( (ctc_loss): CTCLoss()

    Details

    My training script is as followings: python ./openspeech_cli/hydra_train.py dataset="librispeech" dataset.dataset_download=False dataset.dataset_path=$DATASET_PATH dataset.dataset_path="/dataSSD/" dataset.manifest_file_path=$MANIFEST_FILE_PATH dataset.manifest_file_path="/dataSSD/LibriSpeech/libri_subword_manifest.txt" tokenizer=libri_subword tokenizer.vocab_path=$VOCAB_FILE_PATH tokenizer.vocab_path="/dataSSD/LibriSpeech" model=conformer_lstm audio=fbank trainer=gpu lr_scheduler=warmup_reduce_lr_on_plateau ++trainer.accelerator=dp ++trainer.batch_size=64 ++trainer.num_workers=4 criterion=ctc

    opened by ccyang123 3
  • AttributeError: 'NoneType' object has no attribute 'sos_id'

    AttributeError: 'NoneType' object has no attribute 'sos_id'

    ❓ Questions & Help

    AM이 잘 학습되는 것을 확인하여 이번에는 LM 학습을 테스트해보고 있습니다. 그런데 LM 학습이 시작되기 전 AttributeError: 'NoneType' object has no attribute 'sos_id' 라는 오류가 발생했습니다. AM에 사용했었던 labels.csv와 동일한 데이터를 사용했는데 제가 무엇을 놓친걸까요?

    Details

    option 명령어 옵션

    log 오류 로그

    labels labels.csv

    opened by parkmy123 0
  • Questions for LibriSpeech setting

    Questions for LibriSpeech setting

    ❓ Questions & Help

    I'm learning how to use LibriSpeech Dataset to train the squeezeformer network. After 20 epochs training, both the evaluation WER (0.6368) and CER (0.4251) are still very high and not improved anymore. (training WER(0.6405), CER (0.4278))

    These results seem "inconsistent" with the accuracies claimed in the paper. (CER, WER < 0.1) So, I think there must be something wrong with my setting.

    (1) Did anyone get the CER/WER accuracies below 0.1 by using the code from opeeenspeech with LibriSpeech dataset?

    (2) Which tokenizer should I use to get good accuracy? (libri_subword or libr_character?) I used the libri_subword now.

    (3) Is my training script correct?

    Details

    (a) The training data and evaluation data setting in the "preprocess.py" is as follows: LIBRI_SPEECH_DATASETS = [ "train-960", "dev-clean", "dev-other", "test-clean", "test-other", ]

    (b) My training script is as follows: python ./openspeech_cli/hydra_train.py dataset="librispeech" dataset.dataset_download=False dataset.dataset_path=$DATASET_PATH dataset.dataset_path="/dataSSD/" dataset.manifest_file_path=$MANIFEST_FILE_PATH dataset.manifest_file_path="/dataSSD/LibriSpeech/libri_subword_manifest.txt" tokenizer=libri_subword tokenizer.vocab_path=$VOCAB_FILE_PATH tokenizer.vocab_path="/dataSSD/LibriSpeech" model=squeezeformer_lstm audio=fbank trainer=gpu lr_scheduler=warmup_reduce_lr_on_plateau ++trainer.accelerator=dp ++trainer.batch_size=128 ++trainer.num_workers=4 criterion=cross_entropy

    (c) The training results after 19 epochs are as follows: 圖片 圖片

    opened by ccyang123 0
  • Could you share the environment of yours to install?

    Could you share the environment of yours to install?

    ❓ Questions & Help

    Thanks for providing an amazing framework! But, unfortunately, I have suffered the issues to install. I think, it is caused from the discrepancies between packages. I saw you share the torch version and others, but still many things are too different from the last year.

    Could you share the environment of conda and pip?

    Details

    opened by sdeva14 1
  • can't pickle dict_keys objects.

    can't pickle dict_keys objects.

    I chose to use multiple GPU for training,I found a problem.The question:type: can not pickle dict_keys objects. I don't know how to solve it,After checking, this problem only occurs on Windows system, but I use ubuntu system Could you tell me how to solve it?,Thank you

    image

    opened by bobo712 1
  • Do we have training results?

    Do we have training results?

    ❓ Questions & Help

    Thanks for your great works on ASR. I am wondering if we have training result data for provided model? I think it is important for users to measure performance of model.

    Details

    opened by Peach-He 0
Releases(v0.4.0)
  • v0.4.0(May 22, 2022)

    What's Changed

    • Resolved #38 by @upskyy in https://github.com/openspeech-team/openspeech/pull/56
    • Add CheckpointEveryNSteps class for save checkpoint every N steps. by @sooftware in https://github.com/openspeech-team/openspeech/pull/57
    • Resolved #58 by @upskyy in https://github.com/openspeech-team/openspeech/pull/59
    • Resolved(#62) by @upskyy in https://github.com/openspeech-team/openspeech/pull/63
    • Resolved #71 by @upskyy in https://github.com/openspeech-team/openspeech/pull/72
    • Resolved #70 by @upskyy in https://github.com/openspeech-team/openspeech/pull/73
    • Resolved #76 by @upskyy in https://github.com/openspeech-team/openspeech/pull/78
    • Resolved #79 by @upskyy in https://github.com/openspeech-team/openspeech/pull/80
    • Add uniform-length batching (smart batching) [resolved #82] - Soohwan Kim by @sooftware in https://github.com/openspeech-team/openspeech/pull/83
    • openspeech/datasets/aishell/preprocess.py line137 modification by @wuxiuzhi738 in https://github.com/openspeech-team/openspeech/pull/95
    • Fix typos by @sooftware in https://github.com/openspeech-team/openspeech/pull/107
    • fix url typo error by @YongWookHa in https://github.com/openspeech-team/openspeech/pull/114
    • Resolved #116 by @upskyy in https://github.com/openspeech-team/openspeech/pull/118
    • Resolved #128 by @upskyy in https://github.com/openspeech-team/openspeech/pull/129
    • Update evaluation codes (Fixes #86) by @upskyy in https://github.com/openspeech-team/openspeech/pull/145
    • Version 0.4.0 by @upskyy in https://github.com/openspeech-team/openspeech/pull/165

    New Contributors

    • @wuxiuzhi738 made their first contribution in https://github.com/openspeech-team/openspeech/pull/95
    • @YongWookHa made their first contribution in https://github.com/openspeech-team/openspeech/pull/114

    Full Changelog: https://github.com/openspeech-team/openspeech/compare/v0.3.0...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 20, 2021)

    • Vocabulary => Tokenizer class
    • Add RNN Transducer beam search
    • Add Transformer Transducer beam search
    • Re-documentation
    • Re-factoring models directory
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Jul 18, 2021)

    Version 0.2.1

    • Add Transformer-transducer
    • Add ContextNet
    • Document update
    • Add language model training pipeline
      • Add lstm_lm model
      • Add transformer_lm model
      • Add perplexity loss function
    • Add string_to_label method to Vocabulary class
    • Fix errors
      • issue #47
    Source code(tar.gz)
    Source code(zip)
  • 0.2(Jun 7, 2021)

  • v0.1(Jun 6, 2021)

Owner
Openspeech TEAM
Open source ecosystem for automatic speech recognition.
Openspeech TEAM
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 3, 2023
Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Soohwan Kim 40 Sep 19, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 3, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

Ke Technologies 34 Sep 8, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 3, 2023
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain 5.1k Jan 9, 2023
Mirco Ravanelli 2.3k Dec 27, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 5k Feb 18, 2021
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022