Speech Recognition using DeepSpeech2.

Overview

deepspeech.pytorch

Build Status

Implementation of DeepSpeech2 for PyTorch using PyTorch Lightning. The repo supports training/testing and inference using the DeepSpeech2 model. Optionally a kenlm language model can be used at inference time.

Install

Several libraries are needed to be installed for training to work. I will assume that everything is being installed in an Anaconda installation on Ubuntu, with PyTorch installed.

Install PyTorch if you haven't already.

If you want decoding to support beam search with an optional language model, install ctcdecode:

git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install .

Finally clone this repo and run this within the repo:

pip install -r requirements.txt
pip install -e . # Dev install

If you plan to use Multi-node training, you'll need etcd. Below is the command to install on Ubuntu.

sudo apt-get install etcd

Docker

To use the image with a GPU you'll need to have nvidia-docker installed.

sudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --tmpfs /tmp -p 8888:8888 --net=host --ipc=host seannaren/deepspeech.pytorch:latest # Opens a Jupyter notebook, mounting the /data drive in the container

Optionally you can use the command line by changing the entrypoint:

sudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --tmpfs /tmp --entrypoint=/bin/bash --net=host --ipc=host seannaren/deepspeech.pytorch:latest

Training

Datasets

Currently supports AN4, TEDLIUM, Voxforge, Common Voice and LibriSpeech. Scripts will setup the dataset and create manifest files used in data-loading. The scripts can be found in the data/ folder. Many of the scripts allow you to download the raw datasets separately if you choose so.

Training Commands

AN4
cd data/ && python an4.py && cd ..

python train.py +configs=an4
LibriSpeech
cd data/ && python an4.py && cd ..

python train.py +configs=librispeech
Common Voice
cd data/ && python an4.py && cd ..

python train.py +configs=commonvoice
TEDlium
cd data/ && python an4.py && cd ..

python train.py +configs=tedlium

Custom Dataset

To create a custom dataset you must create a JSON file containing the locations of the training/testing data. This has to be in the format of:

{
  "root_path":"path/to",
  "samples":[
    {"wav_path":"audio.wav","transcript_path":"text.txt"},
    {"wav_path":"audio2.wav","transcript_path":"text2.txt"},
    ...
  ]
}

Where the root_path is the root directory, wav_path is to the audio file, and the transcript_path is to a text file containing the transcript on one line. This can then be used as stated below.

Note on CSV files ...

Up until release V2.1, deepspeech.pytorch used CSV manifest files instead of JSON. These manifest files are formatted similarly as a 2 column table:

/path/to/audio.wav,/path/to/text.txt
/path/to/audio2.wav,/path/to/text2.txt
...

Note that this format is incompatible V3.0 onwards.

Merging multiple manifest files

To create bigger manifest files (to train/test on multiple datasets at once) we can merge manifest files together like below.

cd data/
python merge_manifests.py manifest_1.json manifest_2.json --out new_manifest_dir

Modifying Training Configs

Configuration is done via Hydra.

Defaults can be seen in config.py. Below is how you can override values set already:

python train.py data.train_path=data/train_manifest.json data.val_path=data/val_manifest.json

Use python train.py --help for all parameters and options.

You can also specify a config file to keep parameters stored in a yaml file like so:

Create folder experiment/ and file experiment/an4.yaml:

data:
  train_path: data/an4_train_manifest.json
  val_path: data/an4_val_manifest.json
python train.py +experiment=an4

To see options available, check here.

Multi-GPU Training

We support single-machine multi-GPU training via PyTorch Lightning.

Below is an example command when training on a machine with 4 local GPUs:

python train.py +configs=an4 trainer.gpus=4

Multi-Node Training

Also supported is multi-machine capabilities using TorchElastic. This requires a node to exist as an explicit etcd host (which could be one of the GPU nodes but isn't recommended), a shared mount across your cluster to load/save checkpoints and communication between the nodes.

Below is an example where we've set one of our GPU nodes as our etcd host however if you're scaling up, it would be suggested to have a separate instance as your etcd instance to your GPU nodes as this will be a single point of failure.

Assumed below is a shared drive called /share where we save our checkpoints and data to access.

Run on the etcd host:

PUBLIC_HOST_NAME=127.0.0.1 # Change to public host name for all nodes to connect
etcd --enable-v2 \
     --listen-client-urls http://$PUBLIC_HOST_NAME:4377 \
     --advertise-client-urls http://$PUBLIC_HOST_NAME:4377 \
     --listen-peer-urls http://$PUBLIC_HOST_NAME:4379

Run on each GPU node:

python -m torchelastic.distributed.launch \
        --nnodes=2 \
        --nproc_per_node=4 \
        --rdzv_id=123 \
        --rdzv_backend=etcd \
        --rdzv_endpoint=$PUBLIC_HOST_NAME:4377 \
        train.py data.train_path=/share/data/an4_train_manifest.json \
                 data.val_path=/share/data/an4_val_manifest.json model.precision=half \
                 data.num_workers=8 checkpoint.save_folder=/share/checkpoints/ \
                 checkpoint.checkpoint=true checkpoint.load_auto_checkpoint=true checkpointing.save_n_recent_models=3 \
                 data.batch_size=8 trainer.max_epochs=70 \
                 trainer.accelerator=ddp trainer.gpus=4 trainer.num_nodes=2

Using the load_auto_checkpoint=true flag we can re-continue training from the latest saved checkpoint.

Currently it is expected that there is an NFS drive/shared mount across all nodes within the cluster to load the latest checkpoint from.

Augmentation

There is support for three different types of augmentations: SpecAugment, noise injection and random tempo/gain perturbations.

SpecAugment

Applies simple Spectral Augmentation techniques directly on Mel spectogram features to make the model more robust to variations in input data. To enable SpecAugment, use the --spec-augment flag when training.

SpecAugment implementation was adapted from this project.

Noise Injection

Dynamically adds noise into the training data to increase robustness. To use, first fill a directory up with all the noise files you want to sample from. The dataloader will randomly pick samples from this directory.

To enable noise injection, use the --noise-dir /path/to/noise/dir/ to specify where your noise files are. There are a few noise parameters to tweak, such as --noise_prob to determine the probability that noise is added, and the --noise-min, --noise-max parameters to determine the minimum and maximum noise to add in training.

Included is a script to inject noise into an audio file to hear what different noise levels/files would sound like. Useful for curating the noise dataset.

python noise_inject.py --input-path /path/to/input.wav --noise-path /path/to/noise.wav --output-path /path/to/input_injected.wav --noise-level 0.5 # higher levels means more noise

Tempo/Gain Perturbation

Applies small changes to the tempo and gain when loading audio to increase robustness. To use, use the --speed-volume-perturb flag when training.

Checkpoints

Typically checkpoints are stored in lightning_logs/ in the current working directory of the script.

This can be adjusted:

python train.py checkpoint.file_path=save_dir/

To load a previously saved checkpoint:

python train.py trainer.resume_from_checkpoint=lightning_logs/deepspeech_checkpoint_epoch_N_iter_N.ckpt

This continues from the same training state.

Testing/Inference

To evaluate a trained model on a test set (has to be in the same format as the training set):

python test.py model.model_path=models/deepspeech.pth test_path=/path/to/test_manifest.json

An example script to output a transcription has been provided:

python transcribe.py model.model_path=models/deepspeech.pth audio_path=/path/to/audio.wav

If you used mixed-precision or half precision when training the model, you can use the model.precision=half for a speed/memory benefit.

Inference Server

Included is a basic server script that will allow post request to be sent to the server to transcribe files.

python server.py --host 0.0.0.0 --port 8000 # Run on one window

curl -X POST http://0.0.0.0:8000/transcribe -H "Content-type: multipart/form-data" -F "file=@/path/to/input.wav"

Using an ARPA LM

We support using kenlm based LMs. Below are instructions on how to take the LibriSpeech LMs found here and tune the model to give you the best parameters when decoding, based on LibriSpeech.

Tuning the LibriSpeech LMs

First ensure you've set up the librispeech datasets from the data/ folder. In addition download the latest pre-trained librispeech model from the releases page, as well as the ARPA model you want to tune from here. For the below we use the 3-gram ARPA model (3e-7 prune).

First we need to generate the acoustic output to be used to evaluate the model on LibriSpeech val.

python test.py data.test_path=data/librispeech_val_manifest.json model.model_path=librispeech_pretrained_v2.pth save_output=librispeech_val_output.npy

We use a beam width of 128 which gives reasonable results. We suggest using a CPU intensive node to carry out the grid search.

python search_lm_params.py --num-workers 16 --saved-output librispeech_val_output.npy --output-path libri_tune_output.json --lm-alpha-from 0 --lm-alpha-to 5 --lm-beta-from 0 --lm-beta-to 3 --lm-path 3-gram.pruned.3e-7.arpa  --model-path librispeech_pretrained_v2.pth --beam-width 128 --lm-workers 16

This will run a grid search across the alpha/beta parameters using a beam width of 128. Use the below script to find the best alpha/beta params:

python select_lm_params.py --input-path libri_tune_output.json

Use the alpha/beta parameters when using the beam decoder.

Building your own LM

To build your own LM you need to use the KenLM repo found here. Have a read of the documentation to get a sense of how to train your own LM. The above steps once trained can be used to find the appropriate parameters.

Alternate Decoders

By default, test.py and transcribe.py use a GreedyDecoder which picks the highest-likelihood output label at each timestep. Repeated and blank symbols are then filtered to give the final output.

A beam search decoder can optionally be used with the installation of the ctcdecode library as described in the Installation section. The test and transcribe scripts have a lm config. To use the beam decoder, add lm.decoder_type=beam. The beam decoder enables additional decoding parameters:

  • lm.beam_width how many beams to consider at each timestep
  • lm.lm_path optional binary KenLM language model to use for decoding
  • lm.alpha weight for language model
  • lm.beta bonus weight for words

Time offsets

Use the offsets=true flag to get positional information of each character in the transcription when using transcribe.py script. The offsets are based on the size of the output tensor, which you need to convert into a format required. For example, based on default parameters you could multiply the offsets by a scalar (duration of file in seconds / size of output) to get the offsets in seconds.

Pre-trained models

Pre-trained models can be found under releases here.

Acknowledgements

Thanks to Egor and Ryan for their contributions!

Comments
  • Language model to predict.py

    Language model to predict.py

    I am very new in this field, I ran train.py on librispeech dataset after complete training I ran test.py as following -

    python test.py --model_path models/deepspeech_final.pth.tar --val_manifest data/libri_test_manifest.csv --cuda

    I got following output: Validation Summary Average WER 3.972 Average CER 0.747

    One more thing I noticed in test.py argument is --val_manifest, that is validation manifest file, But I think it should be --test_manifest?

    Now I wanted to test the same model on unseen data using predict.py, but how do I include language model?

    opened by mit456 27
  • Pre-trained models tracker

    Pre-trained models tracker

    On each of the datasets provided, we must train a Deepspeech model. The overall architecture is encompassed in this command:

    python train.py  --rnn_type gru --hidden_size 800 --hidden_layers 5 --checkpoint --visdom --train_manifest /path/to/train_manifest.csv --val_manifest /path/to/val_manifest.csv --epochs 100 --num_workers $(nproc) --cuda
    

    In the above command you must replace the manifests paths with the correct paths to the dataset. A few notes:

    • No noise injection for the pre-trained models, or augmentations
    • Train till convergence (should get a nice smooth training curve hopefully!)
    • For smaller datasets, you may need to reduce the learning rate annealing by adding the flag --learning anneal and setting it to a smaller value, like 1.01. For larger datasets, the default is fine (up to around 4.5k hours from internal testing on the deepspeech.torch version)

    A release will be cut from the DeepSpeech package that will have the models, and a reference to the latest release added to the README to find latest models!

    Progress tracker for datasets:

    • [x] AN4
    • [x] TEDLium
    • [x] LibriSpeech

    Let me know if you plan on working on running any of these, and I'll update the ticket with details!

    opened by SeanNaren 24
  • Validation loss increasing while WER decreases

    Validation loss increasing while WER decreases

    overfit

    I would like to believe that the model is overfitting but why would the WER keep decreasing if it was overfitting?

    The architecture is as follows: 500 hidden size 5 RNN layers default LR 1.001 annealing factor which I'm increasing by 0.001 every epoch.

    I'm training using Librispeech train-clean-100.tar.gz and validating on dev-clean.tar.gz

    opened by SiddGururani 24
  • KenLM integration (Beam search)

    KenLM integration (Beam search)

    To fully get deepspeech integration, there needs to be a beam search across a language model constrained to a dictionary. I know a few people have been working on this recently and this issue will monitor progress!

    In addition there is C code for KenLM beam search here for Tensorflow that should be portable from what I can see here.

    enhancement help wanted 
    opened by SeanNaren 23
  • pre-trained model trained on all 3 data sets

    pre-trained model trained on all 3 data sets

    This is a wishlist item, and I do not know whether it would make sense: How about a model that is trained on all 3 datasets (AN4, LibriSpeech, TEDLIUM)?

    enhancement help wanted stale 
    opened by zenogantner 19
  • Segmentation fault during training (Volta, others)

    Segmentation fault during training (Volta, others)

    Training on TED as extracted from python ted.py ..., on AWS p3.2xlarge instance with CUDA 9.0, CuDNN 7.0.3, Ubuntu 16.04, and Python 3.5.4 results in Segmentation fault (core dumped) at some point during the first epoch (usually around 70-80% of the way through the batches), seemingly regardless of batch size (tried 32, 26, 12, and 4; also tried with p3.8xlarge and batch size 20). Worth mentioning, but I did not install MAGMA as per the pytorch conda installation instructions:

    # Add LAPACK support for the GPU conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5

    as it seems that the versions mentioned there are incompatible with CUDA 9.0.

    Edit: last output from dmesg

    [14531.790543] python[2191]: segfault at 100324c2400 ip 00007f165177a04a sp 00007f15c1c28c98 error 4 in libcuda.so.384.90[7f16515b2000+b1f000]

    bug stale 
    opened by aaronzira 19
  • DeepSpeech.PyTorch stops working after installing Torch to use also DeepSpeech.Torch

    DeepSpeech.PyTorch stops working after installing Torch to use also DeepSpeech.Torch

    Dear friends,

    My DeepSpeech.PyTorch stopped working after installing Torch to use also DeepSpeech.Torch. See the logs bellow. It is very similar with an another issue of the repo and they said we should use another gcc, but I am not sure exactly what is the REAL problem.

    If a move the torch installation directory, DeepSpeech.PyTorch works again! If a move the torch installation directory back, DeepSpeech.PyTorch fails!

    > dlm@vm001nc6:~/code/deepspeech.pytorch$
    > dlm@vm001nc6:~/code/deepspeech.pytorch$
    > dlm@vm001nc6:~/code/deepspeech.pytorch$ python train.py --train_manifest data/train_manifest.csv --val_manifest data/val_manifest.csv
    > Traceback (most recent call last):
    >   File "train.py", line 9, in <module>
    >     from warpctc_pytorch import CTCLoss
    >   File "/home/dlm/anaconda3/lib/python3.6/site-packages/warpctc_pytorch/__init__.py", line 7, in <module>
    >     from ._warp_ctc import lib as _lib, ffi as _ffi
    > ImportError: /home/dlm/anaconda3/lib/python3.6/site-packages/torch/lib/../../../../libgomp.so.1: version `GOMP_4.0' not found (required by /home/dlm/torch/install/lib/libwarpctc.so)
    > dlm@vm001nc6:~/code/deepspeech.pytorch$
    
    
    opened by dlmacedo 19
  • Release V2

    Release V2

    Few improvements incoming. Currently waiting for models to train, and let me know if anyone is willing to help train baseline models!

    Improvements

    • Remove TorchAudio and use Scipy when loading audio based on speed comparisons and ease of installation
    • Improved implementation of Nvidia Apex to make mixed precision training easier to use
    • New pre-trained models using mixed-precision
    • Documentation and improvements on how to tune and use librispeech LMs, and results based with the 4gram model

    The changes for this can be seen on the V2 branch for anyone curious.

    opened by SeanNaren 18
  • Will padding zeros for variable length input affect batch normalization?

    Will padding zeros for variable length input affect batch normalization?

    Hi,

    I saw in the dataloader that in order to form a input batch with variable length input, the code uses zeros to pad the short sequences. I am not sure if these zeros will affect training the batch normalization, since BN will include these when it computes mean and variance, and might make the variance very small to cause any problem with training?

    Thank you very much for help.

    stale 
    opened by weedwind 17
  • RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

    RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

    this happened when i set cuda=True

    )
    Traceback (most recent call last):
      File "train.py", line 318, in <module>
        main()
      File "train.py", line 182, in main
        out = model(inputs)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 61, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 71, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs)
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 45, in parallel_apply
        raise output
    RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
    
    opened by demobin8 16
  • During training utterances are always processed sorted by duration

    During training utterances are always processed sorted by duration

    As I remember in DeepSpeech paper the samples were sorted by duration only during the first epoch? Maybe it would make sense to switch from the sequential sampler to random sampler in AudioDataLoader after the first epoch?

    enhancement 
    opened by EgorLakomkin 16
Releases(V3.0)
  • V3.0(Jan 30, 2021)

    Release of deepspeech.pytorch, where we've moved to Pytorch Lightning!

    Previous release checkpoints will not be compatible, as a lot was deprecated and cleaned up for the future. Please use V2.1 if you need compatibility.

    • Rely on Pytorch Lightning for training
    • Moved to native CTC function, removing warp-ctc
    • Refactor model objects, clean up technical debt
    • Move towards json structure for manifest files

    Pre-Trained models

    AN4

    Training command:

    python train.py +configs=an4
    

    Test Command:

    python test.py model.model_path=an4_pretrained_v3.ckpt test_path=data/an4_test_manifest.json
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 9.573 | 5.515 |

    Download here.

    Librispeech

    Training command:

    python train.py +configs=librispeech
    

    Test Command:

    python test.py model.model_path=librispeech.ckpt test_path=libri_test_clean_manifest.json
    python test.py model.model_path=librispeech.ckpt test_path=libri_test_other_manifest.json
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 10.463 | 3.399 | |Librispeech other | 28.285 | 12.036 |

    With 3-Gram ARPA LM with tuned alpha/beta values (alpha=1.97, beta=4.36, beam-width=1024)

    Test Command:

    python test.py model.model_path=librispeech.ckpt test_path=data/libri_test_clean_manifest.json lm.decoder_type=beam lm.alpha=1.97 lm.beta=4.36  lm.beam_width=1024 lm.lm_path=3-gram.arpa lm.lm_workers=16
    python test.py model.model_path=librispeech.ckpt test_path=data/libri_test_other_manifest.json lm.decoder_type=beam lm.alpha=1.97 lm.beta=4.36  lm.beam_width=1024 lm.lm_path=3-gram.arpa lm.lm_workers=16
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 7.062 | 2.984 | |Librispeech other | 19.984 | 11.178 |

    Download here.

    TEDLIUM

    Training command:

    python train.py +configs=tedlium
    

    Test Command:

    python test.py model.model_path=ted_pretrained_v3.ckpt test_path=ted_test_manifest.json
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Ted test | 28.056 | 10.548 |

    Download here.

    Source code(tar.gz)
    Source code(zip)
    an4.tar.gz(61.20 MB)
    an4_pretrained_v3.ckpt(991.38 MB)
    librispeech_pretrained_v3.ckpt(991.38 MB)
    ted_pretrained_v3.ckpt(991.38 MB)
  • V2.1(Jan 29, 2021)

    This release represents the last release before the PyTorch Lightning Integration. This is important in case anyone would like to use the old code base before we pivot to Lightning.

    AN4

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/an4_train_manifest.csv --val-manifest data/an4_val_manifest.csv --epochs 70 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 32 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id an4 --checkpoint --save-folder deepspeech.pytorch/an4/ --model-path deepspeech.pytorch/an4/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path an4_pretrained_v2.pth --test-manifest data/an4_val_manifest.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 10.349 | 7.076 |

    Download here.

    Librispeech

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/libri_train_manifest.csv --val-manifest data/libri_val_manifest.csv --epochs 60 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 64 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id libri --checkpoint --save-folder deepspeech.pytorch/librispeech/ --model-path deepspeech.pytorch/librispeech/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path librispeech_pretrained_v2.pth --test-manifest data/libri_test_clean.csv --cuda --half
    python test.py --model-path librispeech_pretrained_v2.pth --test-manifest data/libri_test_other.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 9.919 | 3.307 | |Librispeech other | 28.116 | 12.040 |

    With 3-Gram ARPA LM with tuned alpha/beta values (alpha=1.97, beta=4.36, beam-width=1024)

    Test Command:

    python test.py --test-manifest libri_test_clean.csv --lm-path 3-gram.pruned.3e-7.arpa --decoder beam --alpha 1.97 --beta 4.36 --model-path librispeech_pretrained_v2.pth --lm-workers 8 --num-workers 16 --cuda --half --beam-width 1024
    python test.py --test-manifest libri_test_other.csv --lm-path 3-gram.pruned.3e-7.arpa --decoder beam --alpha 1.97 --beta 4.36 --model-path librispeech_pretrained_v2.pth --lm-workers 8 --num-workers 16 --cuda --half --beam-width 1024
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 6.654 | 2.705 | |Librispeech other | 19.889 | 10.467 |

    Download here.

    TEDLIUM

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/ted_train_manifest.csv --val-manifest data/ted_val_manifest.csv --epochs 60 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 64 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id ted --checkpoint --save-folder deepspeech.pytorch/tedlium/ --model-path deepspeech.pytorch/tedlium/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path ted_pretrained_v2.pth --test-manifest data/ted_test_manifest.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Ted test | 30.886 | 11.196 |

    Download here.

    Source code(tar.gz)
    Source code(zip)
  • v2.0(Oct 1, 2019)

    Supplied are a set of pre-trained networks that can be used for evaluation on academic datasets. Do not expect these models to perform well on your own data! They are heavily tuned to the datasets they are trained on.

    Most results are given using 'greedy decoding', with the addition of WER/CER for LibriSpeech using a LM. Expect a well trained language model to reduce WER/CER substantially.

    Improvements:

    • Remove TorchAudio and use Scipy when loading audio based on speed comparisons and ease of installation
    • Improved implementation of Nvidia Apex to make mixed precision training easier to use
    • New pre-trained models using mixed-precision
    • Documentation and improvements on how to tune and use librispeech LMs, and results based with the 3-gram model
    • Evaluation fixes for fairer comparison

    Commit Hash used for training and testing.

    AN4

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/an4_train_manifest.csv --val-manifest data/an4_val_manifest.csv --epochs 70 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 32 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id an4 --checkpoint --save-folder deepspeech.pytorch/an4/ --model-path deepspeech.pytorch/an4/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path an4_pretrained_v2.pth --test-manifest data/an4_val_manifest.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 10.349 | 7.076 |

    Download here.

    Librispeech

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/libri_train_manifest.csv --val-manifest data/libri_val_manifest.csv --epochs 60 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 64 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id libri --checkpoint --save-folder deepspeech.pytorch/librispeech/ --model-path deepspeech.pytorch/librispeech/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path librispeech_pretrained_v2.pth --test-manifest data/libri_test_clean.csv --cuda --half
    python test.py --model-path librispeech_pretrained_v2.pth --test-manifest data/libri_test_other.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 9.919 | 3.307 | |Librispeech other | 28.116 | 12.040 |

    With 3-Gram ARPA LM with tuned alpha/beta values (alpha=1.97, beta=4.36, beam-width=1024)

    Test Command:

    python test.py --test-manifest libri_test_clean.csv --lm-path 3-gram.pruned.3e-7.arpa --decoder beam --alpha 1.97 --beta 4.36 --model-path librispeech_pretrained_v2.pth --lm-workers 8 --num-workers 16 --cuda --half --beam-width 1024
    python test.py --test-manifest libri_test_other.csv --lm-path 3-gram.pruned.3e-7.arpa --decoder beam --alpha 1.97 --beta 4.36 --model-path librispeech_pretrained_v2.pth --lm-workers 8 --num-workers 16 --cuda --half --beam-width 1024
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 6.654 | 2.705 | |Librispeech other | 19.889 | 10.467 |

    Download here.

    TEDLIUM

    Training command:

    python train.py --rnn-type lstm --hidden-size 1024 --hidden-layers 5  --train-manifest data/ted_train_manifest.csv --val-manifest data/ted_val_manifest.csv --epochs 60 --num-workers 16 --cuda  --learning-anneal 1.01 --batch-size 64 --no-sortaGrad --visdom  --opt-level O1 --loss-scale 1 --id ted --checkpoint --save-folder deepspeech.pytorch/tedlium/ --model-path deepspeech.pytorch/tedlium/deepspeech_final.pth
    

    Test Command:

    python test.py --model-path ted_pretrained_v2.pth --test-manifest data/ted_test_manifest.csv --cuda --half
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Ted test | 30.886 | 11.196 |

    Download here.

    Source code(tar.gz)
    Source code(zip)
    an4_pretrained_v2.pth(660.90 MB)
    librispeech_pretrained_v2.pth(660.90 MB)
    ted_pretrained_v2.pth(660.90 MB)
  • v1.2(Apr 20, 2018)

    This release is functionally identical to the previous but includes various bugfixes. The previously released models are still compatible. Performance of the pretrained models:

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 9.573 | 3.977 | |Librispeech test clean | 10.239 | 2.765 | |Librispeech test other | 28.008 | 9.791 |

    Source code(tar.gz)
    Source code(zip)
  • v1.1(Jan 12, 2018)

    Supplied are a set of pre-trained networks that can be used for evaluation. Do not expect these models to perform well on your own data! They are heavily tuned to the datasets they are trained on.

    Results are given using greedy decoding. Expect a well trained language model to reduce WER/CER substantially.

    These models should work with later versions of deepspeech.pytorch. A note to consider is that parameters have changed from underscores to dashes (i.e --rnn_type is now --rnn-type).

    AN4

    Commit hash: e2c2d832357a992f36e68b5f378c117dd270d6ff

    Training command:

    python train.py  --rnn_type gru --hidden_size 800 --hidden_layers 5 --checkpoint --train_manifest data/an4_train_manifest.csv --val_manifest data/an4_val_manifest.csv --epochs 100 --num_workers $(nproc) --cuda --batch_size 32 --learning_anneal 1.01 --augment
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 10.58 | 4.88 |

    Download here.

    Librispeech

    Commit hash: e2c2d832357a992f36e68b5f378c117dd270d6ff

    Training command:

    python train.py  --rnn_type gru --hidden_size 800 --hidden_layers 5 --checkpoint --visdom --train_manifest data/libri_train_manifest.csv --val_manifest data/libri_val_manifest.csv --epochs 15 --num_workers $(nproc) --cuda --checkpoint --batch_size 10 --learning_anneal 1.1
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 11.27 | 3.09 | |Librispeech other | 30.74 | 10.97 |

    Download here.

    TEDLIUM

    Commit hash: e2c2d832357a992f36e68b5f378c117dd270d6ff

    Training command:

    python train.py  --rnn_type gru --hidden_size 800 --hidden_layers 5 --checkpoint --visdom --train_manifest data/ted_train_manifest.csv --val_manifest data/ted_val_manifest.csv --epochs 15 --num_workers $(nproc) --cuda --checkpoint --batch_size 10 --learning_anneal 1.1
    

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Ted test | 31.04 | 10.00 |

    Download here.

    Source code(tar.gz)
    Source code(zip)
    an4_pretrained.pth(290.66 MB)
    librispeech_pretrained.pth(290.66 MB)
    tedlium_pretrained.pth(290.66 MB)
  • v1.0(Aug 24, 2017)

    Supplied are a set of pre-trained networks that can be used for evaluation. Do not expect these models to perform well on your own data! They are heavily tuned to the datasets they are trained on.

    Results are given using greedy decoding. Expect a well trained language model to reduce WER/CER substantially.

    AN4

    Download here.

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |AN4 test | 10.52 | 4.78 |

    LibriSpeech

    Download here.

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |Librispeech clean | 11.20 | 3.36 | |Librispeech other | 31.31 | 12.29 |

    TEDLIUM

    Download here.

    |Dataset | WER | CER | |-----------------|:--------:|:--------:| |TED test| 34.01 | 13.14 |

    Source code(tar.gz)
    Source code(zip)
    an4_pretrained.pth(290.48 MB)
    librispeech_pretrained.pth(290.48 MB)
    tedlium_pretrained.pth(290.66 MB)
Owner
Sean Naren
Sean Naren
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 4, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
Face-Recognition-Attendence-System - This face recognition Attendence system using Python

Face-Recognition-Attendence-System I have developed this face recognition Attend

Riya Gupta 4 May 10, 2022
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

Young-Seok Choi 23 Jan 25, 2022
PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

Soohwan Kim 40 Sep 19, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

Maxime Burchi 145 Dec 30, 2022
Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 1, 2022
Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

Ankush Malaker 5 Nov 11, 2022
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version>=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

kenro515 3 Jan 4, 2023
A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

張致強 14 Dec 2, 2022
Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python.

RISHABH MISHRA 1 Feb 13, 2022
Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

Wenwen Yu 255 Dec 29, 2022
Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Chinese mandarin text to speech based on Fastspeech2 and Unet This is a modification and adpation of fastspeech2 to mandrin(普通话). Many modifications t

null 291 Jan 2, 2023
Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network This repository is the official implementation of Speech Separati

Kai Li (李凯) 116 Nov 9, 2022
Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks Abstract Facial expression recognition in video

Bogireddy Sai Prasanna Teja Reddy 103 Dec 29, 2022