pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.


The PyTorch-Kaldi Speech Recognition Toolkit

PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition systems. The DNN part is managed by PyTorch, while feature extraction, label computation, and decoding are performed with the Kaldi toolkit.

This repository contains the last version of the PyTorch-Kaldi toolkit (PyTorch-Kaldi-v1.0). To take a look into the previous version (PyTorch-Kaldi-v0.1), click here.

If you use this code or part of it, please cite the following paper:

M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", arXiv

title    = {The PyTorch-Kaldi Speech Recognition Toolkit},
author    = {M. Ravanelli and T. Parcollet and Y. Bengio},
booktitle    = {In Proc. of ICASSP},
year    = {2019}

The toolkit is released under a Creative Commons Attribution 4.0 International license. You can copy, distribute, modify the code for research, commercial and non-commercial purposes. We only ask to cite our paper referenced above.

To improve transparency and replicability of speech recognition results, we give users the possibility to release their PyTorch-Kaldi model within this repository. Feel free to contact us (or doing a pull request) for that. Moreover, if your paper uses PyTorch-Kaldi, it is also possible to advertise it in this repository.

See a short introductory video on the PyTorch-Kaldi Toolkit


We are happy to announce that the SpeechBrain project ( is now public! We strongly encourage users to migrate to Speechbrain. It is a much better project which already supports several speech processing tasks, such as speech recognition, speaker recognition, SLU, speech enhancement, speech separation, multi-microphone signal processing and many others.

The goal is to develop a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g, beamforming), self-supervised learning, and many others.

The project will be lead by Mila and is sponsored by Samsung, Nvidia, Dolby. SpeechBrain will also benefit from the collaboration and expertise of other companies such as Facebook/PyTorch, IBMResearch, FluentAI.

We are actively looking for collaborators. Feel free to contact us at [email protected] if you are interested to collaborate.

Thanks to our sponsors we are also able to hire interns working at Mila on the SpeechBrain project. The ideal candidate is a PhD student with experience on pytorch and speech technologies (send your CV to [email protected])

The development of SpeechBrain will require some months before having a working repository. Meanwhile, we will continue providing support for the pytorch-kaldi project.

Stay Tuned!

Table of Contents


The PyTorch-Kaldi project aims to bridge the gap between the Kaldi and the PyTorch toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these toolkits, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with rich documentation and is designed to properly work locally or on HPC clusters.

Some features of the new version of the PyTorch-Kaldi toolkit:

  • Easy interface with Kaldi.
  • Easy plug-in of user-defined models.
  • Several pre-implemented models (MLP, CNN, RNN, LSTM, GRU, Li-GRU, SincNet).
  • Natural implementation of complex models based on multiple features, labels, and neural architectures.
  • Easy and flexible configuration files.
  • Automatic recovery from the last processed chunk.
  • Automatic chunking and context expansions of the input features.
  • Multi-GPU training.
  • Designed to work locally or on HPC clusters.
  • Tutorials on TIMIT and Librispeech Datasets.


  1. If not already done, install Kaldi ( As suggested during the installation, do not forget to add the path of the Kaldi binaries into $HOME/.bashrc. For instance, make sure that .bashrc contains the following paths:
export KALDI_ROOT=/home/mirco/kaldi-trunk
export PATH

Remember to change the KALDI_ROOT variable using your path. As a first test to check the installation, open a bash shell, type "copy-feats" or "hmm-info" and make sure no errors appear.

  1. If not already done, install PyTorch ( We tested our codes on PyTorch 1.0 and PyTorch 0.4. An older version of PyTorch is likely to raise errors. To check your installation, type “python” and, once entered into the console, type “import torch”, and make sure no errors appear.

  2. We recommend running the code on a GPU machine. Make sure that the CUDA libraries ( are installed and correctly working. We tested our system on Cuda 9.0, 9.1 and 8.0. Make sure that python is installed (the code is tested with python 2.7 and python 3.7). Even though not mandatory, we suggest using Anaconda (

Recent updates

19 Feb. 2019: updates:

  • It is now possible to dynamically change batch size, learning rate, and dropout factors during training. We thus implemented a scheduler that supports the following formalism within the config files:
batch_size_train = 128*12 | 64*10 | 32*2

The line above means: do 12 epochs with 128 batches, 10 epochs with 64 batches, and 2 epochs with 32 batches. A similar formalism can be used for learning rate and dropout scheduling. See this section for more information.

5 Feb. 2019: updates:

  1. Our toolkit now supports parallel data loading (i.e., the next chunk is stored in memory while processing the current chunk). This allows a significant speed up.
  2. When performing monophone regularization users can now set “dnn_lay = N_lab_out_mono”. This way the number of monophones is automatically inferred by our toolkit.
  3. We integrated the kaldi-io toolkit from the kaldi-io-for-python project into data_io-py.
  4. We provided a better hyperparameter setting for SincNet (see this section)
  5. We released some baselines with the DIRHA dataset (see this section). We also provide some configuration examples for a simple autoencoder (see this section) and for a system that jointly trains a speech enhancement and a speech recognition module (see this section)
  6. We fixed some minor bugs.

Notes on the next version: In the next version, we plan to further extend the functionalities of our toolkit, supporting more models and features formats. The goal is to make our toolkit suitable for other speech-related tasks such as end-to-end speech recognition, speaker-identification, keyword spotting, speech separation, speech activity detection, speech enhancement, etc. If you would like to propose some novel functionalities, please give us your feedback by filling this survey.

How to install

To install PyTorch-Kaldi, do the following steps:

  1. Make sure all the software recommended in the “Prerequisites” sections are installed and are correctly working
  2. Clone the PyTorch-Kaldi repository:
git clone
  1. Go into the project folder and Install the needed packages with:
pip install -r requirements.txt

TIMIT tutorial

In the following, we provide a short tutorial of the PyTorch-Kaldi toolkit based on the popular TIMIT dataset.

  1. Make sure you have the TIMIT dataset. If not, it can be downloaded from the LDC website (

  2. Make sure Kaldi and PyTorch installations are fine. Make also sure that your KALDI paths are currently working (you should add the Kaldi paths into the .bashrc as reported in the section "Prerequisites"). For instance, type "copy-feats" and "hmm-info" and make sure no errors appear.

  3. Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute features and labels later used to train the PyTorch neural network. We recommend running the full timit s5 recipe (including the DNN training):

cd kaldi/egs/timit/s5

This way all the necessary files are created and the user can directly compare the results obtained by Kaldi with that achieved with our toolkit.

  1. Compute the alignments (i.e, the phone-state labels) for test and dev data with the following commands (go into $KALDI_ROOT/egs/timit/s5). If you want to use tri3 alignments, type:
steps/ --nj 4 data/dev data/lang exp/tri3 exp/tri3_ali_dev

steps/ --nj 4 data/test data/lang exp/tri3 exp/tri3_ali_test

If you want to use dnn alignments (as suggested), type:

steps/nnet/ --nj 4 data-fmllr-tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali

steps/nnet/ --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev

steps/nnet/ --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test
  1. We start this tutorial with a very simple MLP network trained on mfcc features. Before launching the experiment, take a look at the configuration file cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg. See the Description of the configuration files for a detailed description of all its fields.

  2. Change the config file according to your paths. In particular:

  • Set “fea_lst” with the path of your mfcc training list (that should be in $KALDI_ROOT/egs/timit/s5/data/train/feats.scp)
  • Add your path (e.g., $KALDI_ROOT/egs/timit/s5/data/train/utt2spk) into “--utt2spk=ark:”
  • Add your CMVN transformation e.g.,$KALDI_ROOT/egs/timit/s5/mfcc/cmvn_train.ark
  • Add the folder where labels are stored (e.g.,$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali for training and ,$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_dev for dev data).

To avoid errors make sure that all the paths in the cfg file exist. Please, avoid using paths containing bash variables since paths are read literally and are not automatically expanded (e.g., use /home/mirco/kaldi-trunk/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali instead of $KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali)

  1. Run the ASR experiment:
python cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg

This script starts a full ASR experiment and performs training, validation, forward, and decoding steps. A progress bar shows the evolution of all the aforementioned phases. The script progressively creates the following files in the output directory:

  • res.res: a file that summarizes training and validation performance across various validation epochs.
  • log.log: a file that contains possible errors and warnings.
  • conf.cfg: a copy of the configuration file.
  • model.svg is a picture that shows the considered model and how the various neural networks are connected. This is really useful to debug models that are more complex than this one (e.g, models based on multiple neural networks).
  • The folder exp_files contains several files that summarize the evolution of training and validation over the various epochs. For instance, files *.info report chunk-specific information such as the chunk_loss and error and the training time. The *.cfg files are the chunk-specific configuration files (see general architecture for more details), while files *.lst report the list of features used to train each specific chunk.
  • At the end of training, a directory called generated outputs containing plots of loss and errors during the various training epochs is created.

Note that you can stop the experiment at any time. If you run again the script it will automatically start from the last chunk correctly processed. The training could take a couple of hours, depending on the available GPU. Note also that if you would like to change some parameters of the configuration file (e.g., n_chunks=,fea_lst=,batch_size_train=,..) you must specify a different output folder (output_folder=).

Debug: If you run into some errors, we suggest to do the following checks:

  1. Take a look into the standard output.

  2. If it is not helpful, take a look into the log.log file.

  3. Take a look into the function run_nn into the library. Add some prints in the various part of the function to isolate the problem and figure out the issue.

  4. At the end of training, the phone error rate (PER%) is appended into the res.res file. To see more details on the decoding results, you can go into “decoding_test” in the output folder and take a look to the various files created. For this specific example, we obtained the following res.res file:

ep=000 tr=['TIMIT_tr'] loss=3.398 err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86
ep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87
ep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87
ep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88
ep=004 tr=['TIMIT_tr'] loss=1.645 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89
ep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88
ep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.648 err=0.446 lr_architecture1=0.002500 time(s)=89
ep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.649 err=0.446 lr_architecture1=0.002500 time(s)=90
ep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88
ep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88
%WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | /home/mirco/pytorch-kaldi-new/exp/TIMIT_MLP_basic5/decode_TIMIT_test_out_dnn1/score_6/ctm_39phn.filt.sys

The achieved PER(%) is 18.1%. Note that there could be some variability in the results, due to different initializations on different machines. We believe that averaging the performance obtained with different initialization seeds (i.e., change the field seed in the config file) is crucial for TIMIT since the natural performance variability might completely hide the experimental evidence. We noticed a standard deviation of about 0.2% for the TIMIT experiments.

If you want to change the features, you have to first compute them with the Kaldi toolkit. To compute fbank features, you have to open $KALDI_ROOT/egs/timit/s5/ and compute them with the following lines:


for x in train dev test; do
  steps/ --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/ data/$x exp/make_fbank/$x $feadir

Then, change the aforementioned configuration file with the new feature list. If you already have run the full timit Kaldi recipe, you can directly find the fmllr features in $KALDI_ROOT/egs/timit/s5/data-fmllr-tri3. If you feed the neural network with such features you should expect a substantial performance improvement, due to the adoption of the speaker adaptation.

In the TIMIT_baseline folder, we propose several other examples of possible TIMIT baselines. Similarly to the previous example, you can run them by simply typing:

python $cfg_file

There are some examples with recurrent (TIMIT_RNN*,TIMIT_LSTM*,TIMIT_GRU*,TIMIT_LiGRU*) and CNN architectures (TIMIT_CNN*). We also propose a more advanced model (TIMIT_DNN_liGRU_DNN_mfcc+fbank+fmllr.cfg) where we used a combination of feed-forward and recurrent neural networks fed by a concatenation of mfcc, fbank, and fmllr features. Note that the latter configuration files correspond to the best architecture described in the reference paper. As you might see from the above-mentioned configuration files, we improve the ASR performance by including some tricks such as the monophone regularization (i.e., we jointly estimate both context-dependent and context-independent targets). The following table reports the results obtained by running the latter systems (average PER%):

Model mfcc fbank fMLLR
Kaldi DNN Baseline ----- ------ 18.5
MLP 18.2 18.7 16.7
RNN 17.7 17.2 15.9
SRU ----- 16.6 -----
LSTM 15.1 14.3 14.5
GRU 16.0 15.2 14.9
li-GRU 15.5 14.9 14.2

Results show that, as expected, fMLLR features outperform MFCCs and FBANKs coefficients, thanks to the speaker adaptation process. Recurrent models significantly outperform the standard MLP one, especially when using LSTM, GRU, and Li-GRU architecture, that effectively address gradient vanishing through multiplicative gates. The best result PER=$14.2$% is obtained with the Li-GRU model [2,3], that is based on a single gate and thus saves 33% of the computations over a standard GRU.

The best results are actually obtained with a more complex architecture that combines MFCC, FBANK, and fMLLR features (see cfg/TIMI_baselines/TIMIT_mfcc_fbank_fmllr_liGRU_best.cfg). To the best of our knowledge, the PER=13.8% achieved by the latter system yields the best-published performance on the TIMIT test-set.

The Simple Recurrent Units (SRU) is an efficient and highly parallelizable recurrent model. Its performance on ASR is worse than standard LSTM, GRU, and Li-GRU models, but it is significantly faster. SRU is implemented here and described in the following paper:

T. Lei, Y. Zhang, S. I. Wang, H. Dai, Y. Artzi, "Simple Recurrent Units for Highly Parallelizable Recurrence, Proc. of EMNLP 2018. arXiv

To do experiments with this model, use the config file cfg/TIMIT_baselines/TIMIT_SRU_fbank.cfg. Before you should install the model using pip install sru and you should uncomment "import sru" in

You can directly compare your results with ours by going here. In this external repository, you can find all the folders containing the generated files.

Librispeech tutorial

The steps to run PyTorch-Kaldi on the Librispeech dataset are similar to that reported above for TIMIT. The following tutorial is based on the 100h sub-set, but it can be easily extended to the full dataset (960h).

  1. Run the Kaldi recipe for librispeech at least until Stage 13 (included)
  2. Copy exp/tri4b/trans.* files into exp/tri4b/decode_tgsmall_train_clean_100/
mkdir exp/tri4b/decode_tgsmall_train_clean_100 && cp exp/tri4b/trans.* exp/tri4b/decode_tgsmall_train_clean_100/
  1. Compute the fmllr features by running the following script.
. ./ ## You'll want to change to something that will work on your system.
. ./ ## Source the tools/utils (import the


for chunk in train_clean_100 dev_clean test_clean; do
    steps/nnet/ --nj 10 --cmd "$train_cmd" \
        --transform-dir $gmmdir/decode_tgsmall_$chunk \
            $dir data/$chunk $gmmdir $dir/log $dir/data || exit 1

    compute-cmvn-stats --spk2utt=ark:data/$chunk/spk2utt scp:fmllr/$chunk/feats.scp ark:$dir/data/cmvn_speaker.ark
  1. compute aligmenents using:
# aligments on dev_clean and test_clean
steps/ --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100
steps/ --nj 10 data/dev_clean data/lang exp/tri4b exp/tri4b_ali_dev_clean_100
steps/ --nj 10 data/test_clean data/lang exp/tri4b exp/tri4b_ali_test_clean_100
  1. run the experiments with the following command:
  python cfg/Librispeech_baselines/libri_MLP_fmllr.cfg

If you would like to use a recurrent model you can use libri_RNN_fmllr.cfg, libri_LSTM_fmllr.cfg, libri_GRU_fmllr.cfg, or libri_liGRU_fmllr.cfg. The training of recurrent models might take some days (depending on the adopted GPU). The performance obtained with the tgsmall graph are reported in the following table:

Model WER%
MLP 9.6
LSTM 8.6
GRU 8.6
li-GRU 8.6

These results are obtained without adding a lattice rescoring (i.e., using only the tgsmall graph). You can improve the performance by adding lattice rescoring in this way (run it from the kaldi_decoding_script folder of Pytorch-Kaldi):


steps/  $data_dir/lang_test_{tgsmall,fglarge} \
          $data_dir/test_clean $dec_dir $out_dir/decode_test_clean_fglarge   || exit 1;

The final results obtaineed using rescoring (fglarge) are reported in the following table:

Model WER%
MLP 6.5
LSTM 6.4
GRU 6.3
li-GRU 6.2

You can take a look into the results obtained here.

Overview of the toolkit architecture

The main script to run an ASR experiment is This python script performs training, validation, forward, and decoding steps. Training is performed over several epochs, that progressively process all the training material with the considered neural network. After each training epoch, a validation step is performed to monitor the system performance on held-out data. At the end of training, the forward phase is performed by computing the posterior probabilities of the specified test dataset. The posterior probabilities are normalized by their priors (using a count file) and stored into an ark file. A decoding step is then performed to retrieve the final sequence of words uttered by the speaker in the test sentences.

The script takes in input a global config file (e.g., cfg/TIMIT_MLP_mfcc.cfg) that specifies all the needed options to run a full experiment. The code calls another function run_nn (see library) that performs training, validation, and forward operations on each chunk of data. The function run_nn takes in input a chunk-specific config file (e.g, exp/TIMIT_MLP_mfcc/exp_files/train_TIMIT_tr+TIMIT_dev_ep000_ck00.cfg*) that specifies all the needed parameters for running a single-chunk experiment. The run_nn function outputs some info filles (e.g., exp/TIMIT_MLP_mfcc/exp_files/ that summarize losses and errors of the processed chunk.

The results are summarized into the res.res files, while errors and warnings are redirected into the log.log file.

Description of the configuration files:

There are two types of config files (global and chunk-specific cfg files). They are both in INI format and are read, processed, and modified with the configparser library of python. The global file contains several sections, that specify all the main steps of a speech recognition experiments (training, validation, forward, and decoding). The structure of the config file is described in a prototype file (see for instance proto/global.proto) that not only lists all the required sections and fields but also specifies the type of each possible field. For instance, N_ep=int(1,inf) means that the fields N_ep (i.e, number of training epochs) must be an integer ranging from 1 to inf. Similarly, lr=float(0,inf) means that the lr field (i.e., the learning rate) must be a float ranging from 0 to inf. Any attempt to write a config file not compliant with these specifications will raise an error.

Let's now try to open a config file (e.g., cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg) and let's describe the main sections:

cfg_proto = proto/global.proto
cfg_proto_chunk = proto/global_chunk.proto

The current version of the config file first specifies the paths of the global and chunk-specific prototype files in the section [cfg_proto].

cmd = 
run_nn_script = run_nn
out_folder = exp/TIMIT_MLP_basic5
seed = 1234
use_cuda = True
multi_gpu = False
save_gpumem = False
n_epochs_tr = 24

The section [exp] contains some important fields, such as the output folder (out_folder) and the path of the chunk-specific processing script run_nn (by default this function should be implemented in the library). The field N_epochs_tr specifies the selected number of training epochs. Other options about using_cuda, multi_gpu, and save_gpumem can be enabled by the user. The field cmd can be used to append a command to run the script on a HPC cluster.

data_name = TIMIT_tr
fea = fea_name=mfcc
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/train/utt2spk  ark:quick_test/mfcc/train_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
lab = lab_name=lab_cd
n_chunks = 5

data_name = TIMIT_dev
fea = fea_name=mfcc
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/dev/utt2spk  ark:quick_test/mfcc/dev_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
lab = lab_name=lab_cd
n_chunks = 1

data_name = TIMIT_test
fea = fea_name=mfcc
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/test/utt2spk  ark:quick_test/mfcc/test_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
lab = lab_name=lab_cd
n_chunks = 1

The config file contains a number of sections ([dataset1], [dataset2], [dataset3],...) that describe all the corpora used for the ASR experiment. The fields on the [dataset*] section describe all the features and labels considered in the experiment. The features, for instance, are specified in the field fea:, where fea_name contains the name given to the feature, fea_lst is the list of features (in the scp Kaldi format), fea_opts allows users to specify how to process the features (e.g., doing CMVN or adding the derivatives), while cw_left and cw_right set the characteristics of the context window (i.e., number of left and right frames to append). Note that the current version of the PyTorch-Kaldi toolkit supports the definition of multiple features streams. Indeed, as shown in cfg/TIMIT_baselines/TIMIT_mfcc_fbank_fmllr_liGRU_best.cfg multiple feature streams (e.g., mfcc, fbank, fmllr) are employed.

Similarly, the lab section contains some sub-fields. For instance, lab_name refers to the name given to the label, while lab_folder contains the folder where the alignments generated by the Kaldi recipe are stored. lab_opts allows the user to specify some options on the considered alignments. For example lab_opts="ali-to-pdf" extracts standard context-dependent phone-state labels, while lab_opts=ali-to-phones --per-frame=true can be used to extract monophone targets. lab_count_file is used to specify the file that contains the counts of the considered phone states. These counts are important in the forward phase, where the posterior probabilities computed by the neural network are divided by their priors. PyTorch-Kaldi allows users to both specify an external count file or to automatically retrieve it (using lab_count_file=auto). Users can also specify lab_count_file=none if the count file is not strictly needed, e.g., when the labels correspond to an output not used to generate the posterior probabilities used in the forward phase (see for instance the monophone targets in cfg/TIMIT_baselines/TIMIT_MLP_mfcc.cfg). lab_data_folder, instead, corresponds to the data folder created during the Kaldi data preparation. It contains several files, including the text file eventually used for the computation of the final WER. The last sub-field lab_graph is the path of the Kaldi graph used to generate the labels.

The full dataset is usually large and cannot fit the GPU/RAM memory. It should thus be split into several chunks. PyTorch-Kaldi automatically splits the dataset into the number of chunks specified in N_chunks. The number of chunks might depend on the specific dataset. In general, we suggest processing speech chunks of about 1 or 2 hours (depending on the available memory).

train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = TIMIT_test

This section tells how the data listed into the sections [datasets*] are used within the script. The first line means that we perform training with the data called TIMIT_tr. Note that this dataset name must appear in one of the dataset sections, otherwise the config parser will raise an error. Similarly, the second and third lines specify the data used for validation and forward phases, respectively.

batch_size_train = 128
max_seq_length_train = 1000
increase_seq_length_train = False
start_seq_len_train = 100
multply_factor_seq_len_train = 2
batch_size_valid = 128
max_seq_length_valid = 1000

batch_size_train is used to define the number of training examples in the mini-batch. The fields max_seq_length_train truncates the sentences longer than the specified value. When training recurrent models on very long sentences, out-of-memory issues might arise. With this option, we allow users to mitigate such memory problems by truncating long sentences. Moreover, it is possible to progressively grow the maximum sentence length during training by setting increase_seq_length_train=True. If enabled, the training starts with a maximum sentence length specified in start_seq_len_train (e.g, start_seq_len_train=100). After each epoch the maximum sentence length is multiplied by the multply_factor_seq_len_train (e.g multply_factor_seq_len_train=2). We have observed that this simple strategy generally improves the system performance since it encourages the model to first focus on short-term dependencies and learn longer-term ones only at a later stage.

Similarly,batch_size_valid and max_seq_length_valid specify the number of examples in the mini-batches and the maximum length for the dev dataset.

arch_name = MLP_layers1
arch_proto = proto/MLP.proto
arch_library = neural_networks
arch_class = MLP
arch_pretrain_file = none
arch_freeze = False
arch_seq_model = False
dnn_lay = 1024,1024,1024,1024,N_out_lab_cd
dnn_drop = 0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp = False
dnn_use_batchnorm_inp = False
dnn_use_batchnorm = True,True,True,True,False
dnn_use_laynorm = False,False,False,False,False
dnn_act = relu,relu,relu,relu,softmax
arch_lr = 0.08
arch_halving_factor = 0.5
arch_improvement_threshold = 0.001
arch_opt = sgd
opt_momentum = 0.0
opt_weight_decay = 0.0
opt_dampening = 0.0
opt_nesterov = False

The sections [architecture*] are used to specify the architectures of the neural networks involved in the ASR experiments. The field arch_name specifies the name of the architecture. Since different neural networks can depend on a different set of hyperparameters, the user has to add the path of a proto file that contains the list of hyperparameters into the field proto. For example, the prototype file for a standard MLP model contains the following fields:


Similarly to the other prototype files, each line defines a hyperparameter with the related value type. All the hyperparameters defined in the proto file must appear into the global configuration file under the corresponding [architecture*] section. The field arch_library specifies where the model is coded (e.g., while arch_class indicates the name of the class where the architecture is implemented (e.g. if we set class=MLP we will do from import MLP).

The field arch_pretrain_file can be used to pre-train the neural network with a previously-trained architecture, while arch_freeze can be set to False if you want to train the parameters of the architecture during training and should be set to True do keep the parameters fixed (i.e., frozen) during training. The section arch_seq_model indicates if the architecture is sequential (e.g. RNNs) or non-sequential (e.g., a feed-forward MLP or CNN). The way PyTorch-Kaldi processes the input batches is different in the two cases. For recurrent neural networks (arch_seq_model=True) the sequence of features is not randomized (to preserve the elements of the sequences), while for feedforward models (arch_seq_model=False) we randomize the features (this usually helps to improve the performance). In the case of multiple architectures, sequential processing is used if at least one of the employed architectures is marked as sequential (arch_seq_model=True).

Note that the hyperparameters starting with "arch_" and "opt_" are mandatory and must be present in all the architecture specified in the config file. The other hyperparameters (e.g., dnn_*, ) are specific of the considered architecture (they depend on how the class MLP is actually implemented by the user) and can define number and typology of hidden layers, batch and layer normalizations, and other parameters. Other important parameters are related to the optimization of the considered architecture. For instance, arch_lr is the learning rate, while arch_halving_factor is used to implement learning rate annealing. In particular, when the relative performance improvement on the dev-set between two consecutive epochs is smaller than that specified in the arch_improvement_threshold (e.g, arch_improvement_threshold) we multiply the learning rate by the arch_halving_factor (e.g.,arch_halving_factor=0.5). The field arch_opt specifies the type of optimization algorithm. We currently support SGD, Adam, and Rmsprop. The other parameters are specific to the considered optimization algorithm (see the PyTorch documentation for exact meaning of all the optimization-specific hyperparameters). Note that the different architectures defined in [archictecture*] can have different optimization hyperparameters and they can even use a different optimization algorithm.

model_proto = proto/model.proto
model = out_dnn1=compute(MLP_layers1,mfcc)

The way all the various features and architectures are combined is specified in this section with a very simple and intuitive meta-language. The field model: describes how features and architectures are connected to generate as output a set of posterior probabilities. The line out_dnn1=compute(MLP_layers,mfcc) means "feed the architecture called MLP_layers1 with the features called mfcc and store the output into the variable out_dnn1”. From the neural network output out_dnn1 the error and the loss functions are computed using the labels called lab_cd, that have to be previously defined into the [datasets*] sections. The err_final and loss_final fields are mandatory subfields that define the final output of the model.

A much more complex example (discussed here just to highlight the potentiality of the toolkit) is reported in cfg/TIMIT_baselines/TIMIT_mfcc_fbank_fmllr_liGRU_best.cfg:


In this case we first concatenate mfcc, fbank, and fmllr features and we then feed a MLP. The output of the MLP is fed into the a recurrent neural network (specifically a Li-GRU model). We then have another MLP layer (MLP_layers_second) followed by two softmax classifiers (i.e., MLP_layers_last, MLP_layers_last2). The first one estimates standard context-dependent states, while the second estimates monophone targets. The final cost function is a weighted sum between these two predictions. In this way we implement the monophone regularization, that turned out to be useful to improve the ASR performance.

The full model can be considered as a single big computational graph, where all the basic architectures used in the [model] section are jointly trained. For each mini-batch, the input features are propagated through the full model and the cost_final is computed using the specified labels. The gradient of the cost function with respect to all the learnable parameters of the architecture is then computed. All the parameters of the employed architectures are then updated together with the algorithm specified in the [architecture*] sections.

forward_out = out_dnn1
normalize_posteriors = True
normalize_with_counts_from = lab_cd
save_out_file = True
require_decoding = True

The section forward first defines which is the output to forward (it must be defined into the model section). if normalize_posteriors=True, these posterior are normalized by their priors (using a count file). If save_out_file=True, the posterior file (usually a very big ark file) is stored, while if save_out_file=False this file is deleted when not needed anymore. The require_decoding is a boolean that specifies if we need to decode the specified output. The field normalize_with_counts_from set which counts using to normalize the posterior probabilities.

decoding_script_folder = kaldi_decoding_scripts/
decoding_script =
decoding_proto = proto/decoding.proto
min_active = 200
max_active = 7000
max_mem = 50000000
beam = 13.0
latbeam = 8.0
acwt = 0.2
max_arcs = -1
skip_scoring = false
scoring_script = local/
scoring_opts = "--min-lmwt 1 --max-lmwt 10"
norm_vars = False

The decoding section reports parameters about decoding, i.e. the steps that allows one to pass from a sequence of the context-dependent probabilities provided by the DNN into a sequence of words. The field decoding_script_folder specifies the folder where the decoding script is stored. The decoding script field is the script used for decoding (e.g., that should be in the decoding_script_folder specified before. The field decoding_proto reports all the parameters needed for the considered decoding script.

To make the code more flexible, the config parameters can also be specified within the command line. For example, you can run:

 python quick_test/example_newcode.cfg --optimization,lr=0.01 --batches,batch_size=4

The script will replace the learning rate in the specified cfg file with the specified lr value. The modified config file is then stored into out_folder/config.cfg.

The script automatically creates chunk-specific config files, that are used by the run_nn function to perform a single chunk training. The structure of chunk-specific cfg files is very similar to that of the global one. The main difference is a field to_do={train, valid, forward} that specifies the type of processing to on the features chunk specified in the field fea.

Why proto files? Different neural networks, optimization algorithms, and HMM decoders might depend on a different set of hyperparameters. To address this issue, our current solution is based on the definition of some prototype files (for global, chunk, architecture config files). In general, this approach allows a more transparent check of the fields specified into the global config file. Moreover, it allows users to easily add new parameters without changing any line of the python code. For instance, to add a user-defined model, a new proto file (e.g., user-model.proto) that specifies the hyperparameter must be written. Then, the user should only write a class (e.g., user-model in that implements the architecture).


How can I plug-in my model

The toolkit is designed to allow users to easily plug-in their own acoustic models. To add a customized neural model do the following steps:

  1. Go into the proto folder and create a new proto file (e.g., proto/myDNN.proto). The proto file is used to specify the list of the hyperparameters of your model that will be later set into the configuration file. To have an idea about the information to add to your proto file, you can take a look into the MLP.proto file:
  1. The parameter dnn_lay must be a list of string, dnn_drop (i.e., the dropout factors for each layer) is a list of float ranging from 0.0 and 1.0, dnn_use_laynorm_inp and dnn_use_batchnorm_inp are booleans that enable or disable batch or layer normalization of the input. dnn_use_batchnorm and dnn_use_laynorm are a list of boolean that decide layer by layer if batch/layer normalization has to be used. The parameter dnn_act is again a list of string that sets the activation function of each layer. Since every model is based on its own set of hyperparameters, different models have a different prototype file. For instance, you can take a look into GRU.proto and see that the hyperparameter list is different from that of a standard MLP. Similarly to the previous examples, you should add here your list of hyperparameters and save the file.

  2. Write a PyTorch class implementing your model. Open the library and look at some of the models already implemented. For simplicity, you can start taking a look into the class MLP. The classes have two mandatory methods: init and forward. The first one is used to initialize the architecture, the second specifies the list of computations to do. The method init takes in input two variables that are automatically computed within the run_nn function. inp_dim is simply the dimensionality of the neural network input, while options is a dictionary containing all the parameters specified into the section architecture of the configuration file.
    For instance, you can access to the DNN activations of the various layers in this way: options['dnn_lay'].split(','). As you might see from the MLP class, the initialization method defines and initializes all the parameters of the neural network. The forward method takes in input a tensor x (i.e., the input data) and outputs another vector containing x. If your model is a sequence model (i.e., if there is at least one architecture with arch_seq_model=true in the cfg file), x is a tensor with (time_steps, batches, N_in), otherwise is a (batches, N_in) matrix. The class forward defines the list of computations to transform the input tensor into a corresponding output tensor. The output must have the sequential format (time_steps, batches, N_out) for recurrent models and the non-sequential format (batches, N_out) for feed-forward models. Similarly to the already-implemented models the user should write a new class (e.g., myDNN) that implements the customized model:

class myDNN(nn.Module):
    def __init__(self, options,inp_dim):
        super(myDNN, self).__init__()
             // initialize the parameters

            def forward(self, x):
                 // do some computations out=f(x)
                  return out
  1. Create a configuration file. Now that you have defined your model and the list of its hyperparameters, you can create a configuration file. To create your own configuration file, you can take a look into an already existing config file (e.g., for simplicity you can consider cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg). After defining the adopted datasets with their related features and labels, the configuration file has some sections called [architecture*]. Each architecture implements a different neural network. In cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg we only have [architecture1] since the acoustic model is composed of a single neural network. To add your own neural network, you have to write an architecture section (e.g., [architecture1]) in the following way:
arch_name= mynetwork (this is a name you would like to use to refer to this architecture within the following model section)
arch_proto=proto/myDNN.proto (here is the name of the proto file defined before)
arch_library=neural_networks (this is the name of the library where myDNN is implemented)
arch_class=myDNN (This must be the name of the  class you have implemented)
arch_pretrain_file=none (With this you can specify if you want to pre-train your model)
arch_freeze=False (set False if you want to update the parameters of your model)
arch_seq_model=False (set False for feed-forward models, True for recurrent models)

Then, you have to specify proper values for all the hyperparameters specified in proto/myDNN.proto. For the MLP.proto, we have:


Then, add the following parameters related to the optimization of your own architecture. You can use here standard sdg, adam, or rmsprop (see cfg/TIMIT_baselines/TIMIT_LSTM_mfcc.cfg for an example with rmsprop):

  1. Save the configuration file into the cfg folder (e.g, cfg/myDNN_exp.cfg).

  2. Run the experiment with:

python cfg/myDNN_exp.cfg
  1. To debug the model you can first take a look at the standard output. The config file is automatically parsed by the and it raises errors in case of possible problems. You can also take a look into the log.log file to see additional information on the possible errors.

When implementing a new model, an important debug test consists of doing an overfitting experiment (to make sure that the model is able to overfit a tiny dataset). If the model is not able to overfit, it means that there is a major bug to solve.

  1. Hyperparameter tuning. In deep learning, it is often important to play with the hyperparameters to find the proper setting for your model. This activity is usually very computational and time-consuming but is often necessary when introducing new architectures. To help hyperparameter tuning, we developed a utility that implements a random search of the hyperparameters (see next section for more details).

How can I tune the hyperparameters

A hyperparameter tuning is often needed in deep learning to search for proper neural architectures. To help tuning the hyperparameters within PyTorch-Kaldi, we have implemented a simple utility that implements a random search. In particular, the script generates a set of random configuration files and can be run in this way:

python cfg/TIMIT_MLP_mfcc.cfg exp/TIMIT_MLP_mfcc_tuning 10 arch_lr=randfloat(0.001,0.01) batch_size_train=randint(32,256) dnn_act=choose_str{relu,relu,relu,relu,softmax|tanh,tanh,tanh,tanh,softmax}

The first parameter is the reference cfg file that we would like to modify, while the second one is the folder where the random configuration files are saved. The third parameter is the number of the random config file that we would like to generate. There is then the list of all the hyperparameters that we want to change. For instance, arch_lr=randfloat(0.001,0.01) will replace the field arch_lr with a random float ranging from 0.001 to 0.01. batch_size_train=randint(32,256) will replace batch_size_train with a random integer between 32 and 256 and so on. Once the config files are created, they can be run sequentially or in parallel with:

python $cfg_file

How can I use my own dataset

PyTorch-Kaldi can be used with any speech dataset. To use your own dataset, the steps to take are similar to those discussed in the TIMIT/Librispeech tutorials. In general, what you have to do is the following:

  1. Run the Kaldi recipe with your dataset. Please, see the Kaldi website to have more information on how to perform data preparation.
  2. Compute the alignments on training, validation, and test data.
  3. Write a PyTorch-Kaldi config file $cfg_file.
  4. Run the config file with python $cfg_file.

How can I plug-in my own features

The current version of PyTorch-Kaldi supports input features stored with the Kaldi ark format. If the user wants to perform experiments with customized features, the latter must be converted into the ark format. Take a look into the Kaldi-io-for-python git repository ( for a detailed description about converting numpy arrays into ark files. Moreover, you can take a look into our utility called This script generates Kaldi ark files containing raw features, that are later used to train neural networks fed by the raw waveform directly (see the section about processing audio with SincNet).

How can I transcript my own audio files

The current version of Pytorch-Kaldi supports the standard production process of using a Pytorch-Kaldi pre-trained acoustic model to transcript one or multiples .wav files. It is important to understand that you must have a trained Pytorch-Kaldi model. While you don't need labels or alignments anymore, Pytorch-Kaldi still needs many files to transcript a new audio file:

  1. The features and features list feats.scp (with .ark files, see #how-can-i-plug-my-own-features)
  2. The decoding graph (usually created with during previous model training such as triphones models). This graph is not needed if you're not decoding.

Once you have all these files, you can start adding your dataset section to the global configuration file. The easiest way is to copy the cfg file used to train your acoustic model and just modify by adding a new [dataset]:

data_name = myWavFile
fea = fea_name=fbank
  fea_opts=apply-cmvn --utt2spk=ark:myWavFilePath/data//utt2spk  ark:myWavFilePath/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |

lab = lab_name=none

train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = myWavFile

The key string for your audio file transcription is lab_name=none. The none tag asks Pytorch-Kaldi to enter a production mode that only does the forward propagation and decoding without any labels. You don't need TIMIT_tr and TIMIT_dev to be on your production server since Pytorch-Kaldi will skip this information to directly go to the forward phase of the dataset given in the forward_with field. As you can see, the global fea field requires the exact same parameters than standard training or testing dataset, while the lab field only requires two parameters. Please, note that lab_data_folder is nothing more than the same path as fea_lst. Finally, you still need to specify the number of chunks you want to create to process this file (1 hour = 1 chunk).
In your standard .cfg, you might have used keywords such as N_out_lab_cd that can not be used anymore. Indeed, in a production scenario, you don't want to have the training data on your machine. Therefore, all the variables that were on your .cfg file must be replaced by their true values. To replace all the N_out_{mono,lab_cd} you can take a look at the output of:

hmm-info /path/to/the/final.mdl/used/to/generate/the/training/ali

Then, if you normalize posteriors as (check in your .cfg Section forward):

normalize_posteriors = True
normalize_with_counts_from = lab_cd

You must replace lab_cd by:

normalize_posteriors = True
normalize_with_counts_from = /path/to/ali_train_pdf.counts

This normalization step is crucial for HMM-DNN speech recognition. DNNs, in fact, provide posterior probabilities, while HMMs are generative models that work with likelihoods. To derive the required likelihoods, one can simply divide the posteriors by the prior probabilities. To create this ali_train_pdf.counts file you can follow:

alidir=/path/to/the/exp/tri_ali (change it with your path to the exp with the ali)
num_pdf=$(hmm-info $alidir/final.mdl | awk '/pdfs/{print $4}')
labels_tr_pdf="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz |\" ark:- |"
analyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf "$labels_tr_pdf" ali_train_pdf.counts

et voilà ! In a production scenario, you might need to transcript a huge number of audio files, and you don't want to create as much as needed .cfg file. In this extent, and after creating this initial production .cfg file (you can leave the path blank), you can call the script with specific arguments referring to your different.wav features:

python cfg/TIMIT_baselines/TIMIT_MLP_fbank_prod.cfg --dataset4,fea,0,fea_lst="myWavFilePath/data/feats.scp" --dataset4,lab,0,lab_data_folder="myWavFilePath/data/" --dataset4,lab,0,lab_graph="myWavFilePath/exp/tri3/graph/"

This command will internally alter the configuration file with your specified paths, and run and your defined features! Note that passing long arguments to the script requires a specific notation. --dataset4 specifies the name of the created section, fea is the name of the higher level field, fea_lst or lab_graph are the name of the lowest level field you want to change. The 0 is here to indicate which lowest level field you want to alter, indeed some configuration files may contain multiple lab_graph per dataset! Therefore, 0 indicates the first occurrence, 1 the second ... Paths MUST be encapsulated by " " to be interpreted as full strings! Note that you need to alter the data_name and forward_with fields if you don't want different .wav files transcriptions to erase each other (decoding files are stored accordingly to the fielddata_name). --dataset4,data_name=MyNewName --data_use,forward_with=MyNewName.

Batch size, learning rate, and dropout scheduler

In order to give users more flexibility, the latest version of PyTorch-Kaldi supports scheduling of the batch size, max_seq_length_train, learning rate, and dropout factor. This means that it is now possible to change these values during training. To support this feature, we implemented the following formalisms within the config files:

batch_size_train = 128*12 | 64*10 | 32*2

In this case, our batch size will be 128 for the first 12 epochs, 64 for the following 10 epochs, and 32 for the last two epochs. By default "*" means "for N times", while "|" is used to indicate a change of the batch size. Note that if the user simply sets batch_size_train = 128, the batch size is kept fixed during all the training epochs by default.

A similar formalism can be used to perform learning rate scheduling:

arch_lr = 0.08*10|0.04*5|0.02*3|0.01*2|0.005*2|0.0025*2

In this case, if the user simply sets arch_lr = 0.08 the learning rate is annealed with the new-bob procedure used in the previous version of the toolkit. In practice, we start from the specified learning rate and we multiply it by a halving factor every time that the improvement on the validation dataset is smaller than the threshold specified in the field arch_improvement_threshold.

Also the dropout factor can now be changed during training with the following formalism:

dnn_drop = 0.15*12|0.20*12,0.15,0.15*10|0.20*14,0.15,0.0

With the line before we can set a different dropout rate for different layers and for different epochs. For instance, the first hidden layer will have a dropout rate of 0.15 for the first 12 epochs, and 0.20 for the other 12. The dropout factor of the second layer, instead, will remain constant to 0.15 over all the training. The same formalism is used for all the layers. Note that "|" indicates a change in the dropout factor within the same layer, while "," indicates a different layer.

You can take a look here into a config file where batch sizes, learning rates, and dropout factors are changed here:


or here:


How can I contribute to the project

The project is still in its initial phase and we invite all potential contributors to participate. We hope to build a community of developers larger enough to progressively maintain, improve, and expand the functionalities of our current toolkit. For instance, it could be helpful to report any bug or any suggestion to improve the current version of the code. People can also contribute by adding additional neural models, that can eventually make richer the set of currently-implemented architectures.


Speech recognition from the raw waveform with SincNet

Take a look into our video introduction to SincNet

SincNet is a convolutional neural network recently proposed to process raw audio waveforms. In particular, SincNet encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning.

For a more detailed description of the SincNet model, please refer to the following papers:

  • M. Ravanelli, Y. Bengio, "Speaker Recognition from raw waveform with SincNet", in Proc. of SLT 2018 ArXiv

  • M. Ravanelli, Y.Bengio, "Interpretable Convolutional Filters with SincNet", in Proc. of NIPS@IRASL 2018 ArXiv

To use this model for speech recognition on TIMIT, to the following steps:

  1. Follows the steps described in the “TIMIT tutorial”.
  2. Save the raw waveform into the Kaldi ark format. To do it, you can use the utility in our repository. The script saves the input signals into a binary Kaldi archive, keeping the alignments with the pre-computed labels. You have to run it for all the data chunks (e.g., train, dev, test). It can also specify the length of the speech chunk (sig_wlen=200 # ms) composing each frame.
  3. Open the cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg, change your paths, and run:
python ./ cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg
  1. With this architecture, we have obtained a PER(%)=17.1%. A standard CNN fed the same features gives us a PER(%)=18.%. Please, see here to take a look into our results. Our results on SincNet outperforms results obtained with MFCCs and FBANKs fed by standard feed-forward networks.

In the following table, we compare the result of SincNet with other feed-forward neural network:

Model WER(%)
MLP -fbank 18.7
MLP -mfcc 18.2
CNN -raw 18.1
SincNet -raw 17.2

Joint training between speech enhancement and ASR

In this section, we show how to use PyTorch-Kaldi to jointly train a cascade between a speech enhancement and a speech recognition neural networks. The speech enhancement has the goal of improving the quality of the speech signal by minimizing the MSE between clean and noisy features. The enhanced features then feed another neural network that predicts context-dependent phone states.

In the following, we report a toy-task example based on a reverberated version of TIMIT, that is only intended to show how users should set the config file to train such a combination of neural networks. Even though some implementation details (and the adopted datasets) are different, this tutorial is inspired by this paper:

  • M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Batch-normalized joint training for DNN-based distant speech recognition", in Proceedings of STL 2016 arXiv

To run the system do the following steps:

1- Make sure you have the standard clean version of TIMIT available.

2- Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute the clean features (that will be the labels of the speech enhancement system) and the alignments (that will be the labels of the speech recognition system). We recommend running the full timit s5 recipe (including the DNN training).

3- The standard TIMIT recipe uses MFCCs features. In this tutorial, instead, we use FBANK features. To compute FBANK features run the following script in $KALDI_ROOT/egs/TIMIT/s5 :


for x in train dev test; do
  steps/ --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/ data/$x exp/make_fbank/$x $feadir

Note that we use 40 FBANKS here, while Kaldi uses by default 23 FBANKs. To compute 40-dimensional features go into "$KALDI_ROOT/egs/TIMIT/conf/fbank.conf" and change the number of considered output filters.

4- Go to this external repository and follow the steps to generate a reverberated version of TIMIT starting from the clean one. Note that this is just a toy task that is only helpful to show how setting up a joint-training system.

5- Compute the FBANK features for the TIMIT_rev dataset. To do it, you can copy the scripts in $KALDI_ROOT/egs/TIMIT/ into $KALDI_ROOT/egs/TIMIT_rev/. Please, copy also the data folder. Note that the audio files in the TIMIT_rev folders are saved with the standard WAV format, while TIMIT is released with the SPHERE format. To bypass this issue, open the files data/train/wav.scp, data/dev/wav.scp, data/test/wav.scp and delete the part about SPHERE reading (e.g., /home/mirco/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav). You also have to change the paths from the standard TIMIT to the reverberated one (e.g. replace /TIMIT/ with /TIMIT_rev/). Remind to remove the final pipeline symbol“ |”. Save the changes and run the computation of the fbank features in this way:


for x in train dev test; do
  steps/ --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/ data/$x exp/make_fbank/$x $feadir

Remember to change the $KALDI_ROOT/egs/TIMIT_rev/conf/fbank.conf file in order to compute 40 features rather than the 23 FBANKS of the default configuration.

6- Once features are computed, open the following config file:


Remember to change the paths according to where data are stored in your machine. As you can see, we consider two types of features. The fbank_rev features are computed from the TIMIT_rev dataset, while the fbank_clean features are derived from the standard TIMIT dataset and are used as targets for the speech enhancement neural network. As you can see in the [model] section of the config file, we have the cascade between networks doing speech enhancement and speech recognition. The speech recognition architecture jointly estimates both context-dependent and monophone targets (thus using the so-called monophone regularization). To run an experiment type the following command:

python  cfg/TIMIT_baselines/TIMIT_rev/TIMIT_joint_training_liGRU_fbank.cfg

7- Results With this configuration file, you should obtain a Phone Error Rate (PER)=28.1%. Note that some oscillations around this performance are more than natural and are due to different initialization of the neural parameters.

You can take a closer look into our results here

Distant Speech Recognition with DIRHA

In this tutorial, we use the DIRHA-English dataset to perform a distant speech recognition experiment. The DIRHA English Dataset is a multi-microphone speech corpus being developed under the EC project DIRHA. The corpus is composed of both real and simulated sequences recorded with 32 sample-synchronized microphones in a domestic environment. The database contains signals of different characteristics in terms of noise and reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 6 native US speakers (3 Males, 3 Females) uttering 409 wall-street journal sentences. The training data have been created using a realistic data contamination approach, that is based on contaminating the clean speech wsj-5k sentences with high-quality multi-microphone impulse responses measured in the targeted environment. For more details on this dataset, please refer to the following papers:

  • M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015. ArXiv

  • M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition", in Proceedings of Interspeech 2016. ArXiv

In this tutorial, we use the aforementioned simulated data for training (using LA6 microphone), while test is performed using the real recordings (LA6). This task is very realistic, but also very challenging. The speech signals are characterized by a reverberation time of about 0.7 seconds. Non-stationary domestic noises (such as vacuum cleaner, steps, phone rings, etc.) are also present in the real recordings.

Let’s start now with the practical tutorial.

1- If not available, download the DIRHA dataset from the LDC website. LDC releases the full dataset for a small fee.

2- Go this external reposotory. As reported in this repository, you have to generate the contaminated WSJ dataset with the provided MATLAB script. Then, you can run the proposed KALDI baseline to have features and labels ready for our pytorch-kaldi toolkit.

3- Open the following configuration file:


The latter configuration file implements a simple RNN model based on a Light Gated Recurrent Unit (Li-GRU). We used fMLLR as input features. Change the paths and run the following command:

python cfg/DIRHA_baselines/DIRHA_liGRU_fmllr.cfg

4- Results: The aforementioned system should provide Word Error Rate (WER%)=23.2%. You can find the results obtained by us here.

Using the other configuration files in the cfg/DIRHA_baselines folder you can perform experiments with different setups. With the provided configuration files you can obtain the following results:

Model WER(%)
MLP 26.1
GRU 25.3
Li-GRU 23.8

Training an autoencoder

The current version of the repository is mainly designed for speech recognition experiments. We are actively working a new version, which is much more flexible and can manage input/output different from Kaldi features/labels. Even with the current version, however, it is possible to implement other systems, such as an autoencoder.

An autoencoder is a neural network whose inputs and outputs are the same. The middle layer normally contains a bottleneck that forces our representations to compress the information of the input. In this tutorial, we provide a toy example based on the TIMIT dataset. For instance, see the following configuration file:


Our inputs are the standard 40-dimensional fbank coefficients that are gathered using a context windows of 11 frames (i.e., the total dimensionality of our input is 440). A feed-forward neural network (called MLP_encoder) encodes our features into a 100-dimensional representation. The decoder (called MLP_decoder) is fed by the learned representations and tries to reconstruct the output. The system is trained with Mean Squared Error (MSE) metric. Note that in the [Model] section we added this line “err_final=cost_err(dec_out,lab_cd)” at the end. The current version of the model, in fact, by default needs that at least one label is specified (we will remove this limit in the next version).

You can train the system running the following command:

python cfg/TIMIT_baselines/TIMIT_MLP_fbank_autoencoder.cfg

The results should look like this:

ep=000 tr=['TIMIT_tr'] loss=0.139 err=0.999 valid=TIMIT_dev loss=0.076 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=41
ep=001 tr=['TIMIT_tr'] loss=0.098 err=0.999 valid=TIMIT_dev loss=0.062 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=39
ep=002 tr=['TIMIT_tr'] loss=0.091 err=0.999 valid=TIMIT_dev loss=0.058 err=1.000 lr_architecture1=0.040000 lr_architecture2=0.040000 time(s)=39
ep=003 tr=['TIMIT_tr'] loss=0.088 err=0.999 valid=TIMIT_dev loss=0.056 err=1.000 lr_architecture1=0.020000 lr_architecture2=0.020000 time(s)=38
ep=004 tr=['TIMIT_tr'] loss=0.087 err=0.999 valid=TIMIT_dev loss=0.055 err=0.999 lr_architecture1=0.010000 lr_architecture2=0.010000 time(s)=39
ep=005 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.005000 lr_architecture2=0.005000 time(s)=39
ep=006 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.002500 lr_architecture2=0.002500 time(s)=39
ep=007 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.001250 lr_architecture2=0.001250 time(s)=39
ep=008 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000625 lr_architecture2=0.000625 time(s)=41
ep=009 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000313 lr_architecture2=0.000313 time(s)=38

You should only consider the field "loss=". The filed "err=" only contains not useuful information in this case (for the aforementioned reason). You can take a look into the generated features typing the following command:

copy-feats ark:exp/TIMIT_MLP_fbank_autoencoder/exp_files/forward_TIMIT_test_ep009_ck00_enc_out.ark  ark,t:- | more


[1] M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", ArxIv

[2] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Improving speech recognition by revising gated recurrent units", in Proceedings of Interspeech 2017. ArXiv

[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Light Gated Recurrent Units for Speech Recognition", in IEEE Transactions on Emerging Topics in Computational Intelligence. ArXiv

[4] M. Ravanelli, "Deep Learning for Distant Speech Recognition", PhD Thesis, Unitn 2017. ArXiv

[5] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, Y. Bengio, "Quaternion Recurrent Neural Networks", in Proceedings of ICLR 2019 ArXiv

[6] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, "Bidirectional Quaternion Long-Short Term Memory Recurrent Neural Networks for Speech Recognition", in Proceedings of ICASSP 2019 ArXiv

  • run TIMIT_SincNet_raw error

    run TIMIT_SincNet_raw error

    When I ran the python ./ cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg, the error occurred:

    - Reading config file......OK!
    - Chunk creation......OK!
    ------------------------------ Epoch 000 / 023 ------------------------------
    Training TIMIT_tr chunk = 1 / 10
    ERROR: training epoch 0, chunk 0 not done! File exp/TIMIT_SincNet_raw/exp_files/ does not exist.
    See exp/TIMIT_SincNet_raw/log.log 

    the logs are as follows:

    copy-feats scp:exp/TIMIT_SincNet_raw/exp_files/train_TIMIT_tr_ep000_ck00_raw.lst ark:- 
      LOG (copy-feats[5.5.166~1-013489]:main() Copied 370 feature matrices. 
      ali-to-pdf quick_test/exp_ali/tri3_ali/final.mdl ark:- ark:-           
      LOG (ali-to-pdf[5.5.166~1-013489]:main() Converted 3696 alignments to pdf sequences. 
      copy-feats scp:exp/TIMIT_SincNet_raw/exp_files/train_TIMIT_tr_ep000_ck00_raw.lst ark:- 
      LOG (copy-feats[5.5.166~1-013489]:main() Copied 370 feature matrices. 
      ali-to-phones --per-frame=true quick_test/exp_ali/tri3_ali/final.mdl ark:- ark:-
      LOG (ali-to-phones[5.5.166~1-013489]:main() Done 3696 utterances. 
      Traceback (most recent call last):
        File "", line 208, in <module>
        File "/data/zhanghao/pytorch-kaldi/", line 1584, in forward_model
        File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/", line 477, in __call__ 
          result = self.forward(*input, **kwargs)
        File "/data/zhanghao/pytorch-kaldi/", line 1363, in forward
          x = self.drop[i](self.act[i](self.ln[i](F.max_pool1d(self.conv[i](x), self.sinc_max_pool_len[i])))) 
        File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/", line 477, in __call__ 
          result = self.forward(*input, **kwargs) 
        File "/data/zhanghao/pytorch-kaldi/", line 1500, in forward
          self.filters = (band_pass * self.window_).view( 
      RuntimeError: The size of tensor a (127) must match the size of tensor b (128) at non-singleton dimension 1

    and my torch version is 0.4.0 on Ubuntu16.04

    opened by zdgithub 41
  • How to avoid alingments from test?

    How to avoid alingments from test?


    First of all, thank you for this nice toolkit. Currently, we are trying to build models for Aurora 4 dataset. I am facing a problem with decoding the test set. I have a few doubts:

    1. Why alignments for final test sets are required? Yes, dev set can be used for validation, however, I don't know where test alignments used.
    2. Due to test alignments in Kaldi, of course, some of the test utterances are not aligned properly. This now has a negative impact on reduces test utterances in pytorch DNN decoding. So every time I am decoding results with the label [PARTIAL] in WER. How to avoid this problem and get decoding on all utterances? Kindly help me regarding this. Thank you so much.
    opened by HardikSailor 32
  • I have some error in

    I have some error in

    I am trying "TIMIT tutorial" now. Already I executed until step 6 without some error. But, I have an error in step 7.

    Do you have some idea to fix the error? Please tell me how to fix the error.

    My test environment is the follow.

    • Ubuntu16.04
    • Python3.7 on Anaconda
    • Pytorch1.0 with torchvision0.2
    • Cuda9.0

    The error is the follow.

    $ python3 cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg
    - Reading config file......OK!
    - Chunk creation......OK!
    ------------------------------ Epoch 00 / 23 ------------------------------
    Training TIMIT_tr chunk = 2 / 5
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/home/rc19/dev/rc19/LOCAL/pyenv/versions/anaconda3-5.3.1/envs/py37-kaldi/lib/python3.7/", line 917, in _bootstrap_inner
      File "/home/rc19/dev/rc19/LOCAL/pyenv/versions/anaconda3-5.3.1/envs/py37-kaldi/lib/python3.7/", line 865, in run
        self._target(*self._args, **self._kwargs)
      File "/home/rc19/usr/dev/pytorch-kaldi/", line 229, in read_lab_fea
        [data_name_fea,data_set_fea,data_end_index_fea]=load_chunk(fea_scp,fea_opts,lab_folder,lab_opts,cw_left,cw_right,max_seq_length, output_folder, fea_only)
      File "/home/rc19/usr/dev/pytorch-kaldi/", line 145, in load_chunk
        [data_name,data_set,data_lab,end_index]=load_dataset(fea_scp,fea_opts,lab_folder,lab_opts,left,right, max_sequence_length, output_folder, fea_only)
      File "/home/rc19/usr/dev/pytorch-kaldi/", line 99, in load_dataset
        fea_conc,lab_conc = zip(*fea_sorted)
    ValueError: not enough values to unpack (expected 2, got 0)
    Traceback (most recent call last):
      File "", line 198, in <module>
      File "/home/rc19/usr/dev/pytorch-kaldi/", line 85, in run_nn
    IndexError: list index out of range

    Thank you.

    opened by yuma116 29
  • chunks not match

    chunks not match

    Hi, dear developers, i'm a student working with the pytorch_kaldi. I have a wired question with the number of chunks. In the .cfg file i set the n_chunks(for train set) to 50, but in terminal i saw this:

    ------------------------------ Epoch 00 / 23 ------------------------------
    Training tedlium_tr chunk = 1 / 51

    and after running the training 50/51, the program will be killed normally. I saw feature path in train_tedlium_tr_ep00_ck50.cfg is the same as original configuration file, i think this is the problem why the program is killed. (If i set chunks = 50, it should not have *_ck50.cfg right?)

    and in list_chunks.txt the valid and test set have the same chunks as the train set, although i set the n_chunks =1 for valid and test.

    I think change compute_n_chunks in is not the right way to solve this problem. I think i'm using the right features and alignments. Can you help me? Thank you very much!

    opened by lizhimao 28
  • Accessing the WER

    Accessing the WER

    Hello, I would like to get the wer. It is supposed to be the res.res file (when the training is done) but I can't see it there. I also check if there is file called ctm_39phn.filt.sys in the exp/TIMIT_MLP_basic5/decode_TIMIT_test_out_dnn1/score_* directory but nothing.

    opened by sawibrah 22
  • WER is nan

    WER is nan


    I'm trying run the program with LRS2 dataset, but at the end I got the follow result: Decoding eval output out_dnn2 %WER -nan [ 0 / 0, 0 ins, 0 del, 0 sub ] [PARTIAL] /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/wer_10_0.0 I want to ask if I make something wrong? How can I fix it.

    Thank you Wentao

    opened by wentaoxandry 22
  • nn alignments in TIMIT tutorial

    nn alignments in TIMIT tutorial

    Hi, I'm following your TIMIT tutorial. And... I'm stuck in step 4. I could use the first two commands, but last two were not possible. I thought it can be ignored, but step 7' command required the outputs of those. Sorry, if this is a basic issue .. Can I get some comments..?

    opened by dori2063 21
  • No Decoding Output

    No Decoding Output

    I'm running the TIMIT LSTM on custom features, and I obtained the following error in my log.log file:


    I checked my best path file, but did not see any error messages or warnings.


    I've also double checked my cfg file, and all of the directories exist. I'm running Ubuntu 16.04, CUDA 10.2, PyTorch 1.7.1. What am I doing wrong?

    opened by kevinmchu 20
  • error while running ./ after downloading TIMIT dataset

    error while running ./ after downloading TIMIT dataset

    I just downloaded TIMIT dataset and while running ./ , I got this error . if you can please tell me what went wrong . here's a screenshot . Thanks quest2 quest2-2

    opened by HoussBz 17
  • does not exist does not exist

    I have got the following error. Could you tell me how to solve this?

    • Reading config file......OK!
    • Chunk creation......OK!

    ------------------------------ Epoch 000 / 023 ------------------------------ Training TIMIT_tr chunk = 1 / 5 ERROR: training epoch 0, chunk 0 not done! File exp/TIMIT_MLP_basic/exp_files/ does not exist. See exp/TIMIT_MLP_basic/log.log


    apply-cmvn --utt2spk=ark:/audio/kaldi/kaldi/egs/timit/s5/data/train/utt2spk ark:/audio/kaldi/kaldi/egs/timit/s5/mfcc/cmvn_train.ark ark:- ark:- copy-feats scp:exp/TIMIT_MLP_basic/exp_files/train_TIMIT_tr_ep000_ck00_mfcc.lst ark:- add-deltas --delta-order=2 ark:- ark:- LOG (copy-feats[5.5.193~1-05d9a]:main() Copied 739 feature matrices. LOG (apply-cmvn[5.5.193~1-05d9a]:main() Applied cepstral mean normalization to 739 utterances, errors on 0 ali-to-pdf /audio/kaldi/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali/final.mdl ark:- ark:- LOG (ali-to-pdf[5.5.193~1-05d9a]:main() Converted 6648 alignments to pdf sequences. Traceback (most recent call last): File "", line 109, in [nns,costs]=model_init(inp_out_dict,model,config,arch_dict,use_cuda,multi_gpu,to_do) File "/audio/kaldi/pytorch-kaldi/", line 1455, in model_init net=nn_class(config[arch_dict[inp1][0]],inp_dim) File "/audio/kaldi/pytorch-kaldi/", line 71, in init self.wx = nn.ModuleList([]) AttributeError: 'module' object has no attribute 'ModuleList'

    opened by narcise 17
  • Problem with

    Problem with


    I'm trying the Librispeech dataset and I get the following error:

    (pytorch-kaldi) csanta@lilistt:~/pytorch-kaldi/pytorch-kaldi$ CUDA_VISIBLE_DEVICES=0 python3 cfg/Librispeech_baselines/libri_MLP_fmllr.cfg - Reading config file......OK!

    • Chunk creation......OK!

    ------------------------------ Epoch 0 / 9 ------------------------------ Training train_clean_100 chunk = 1 / 50 [========================================] 100% Training | (Batch 5567/5567)7) | L:4.613 Training train_clean_100 chunk = 2 / 50 [========================================] 100% Training | (Batch 5710/5710)0) | L:2.779 Training train_clean_100 chunk = 3 / 50 [========================================] 100% Training | (Batch 5690/5690)0) | L:2.365 Training train_clean_100 chunk = 4 / 50 [========================================] 100% Training | (Batch 5630/5630)0) | L:2.229 Training train_clean_100 chunk = 5 / 50 [========================================] 100% Training | (Batch 5666/5666)6) | L:2.112 Training train_clean_100 chunk = 6 / 50 [========================================] 100% Training | (Batch 5606/5606)6) | L:2.03 Training train_clean_100 chunk = 7 / 50 [========================================] 100% Training | (Batch 5636/5636)6) | L:1.986 Training train_clean_100 chunk = 8 / 50 [========================================] 100% Training | (Batch 5507/5507)7) | L:1.958 Training train_clean_100 chunk = 9 / 50 [========================================] 100% Training | (Batch 5719/5719)9) | L:1.917 Training train_clean_100 chunk = 10 / 50 [========================================] 100% Training | (Batch 5606/5606)6) | L:1.851 Training train_clean_100 chunk = 11 / 50 [========================================] 100% Training | (Batch 5654/5654)4) | L:1.849 Training train_clean_100 chunk = 12 / 50 [========================================] 100% Training | (Batch 5585/5585)5) | L:1.808 Training train_clean_100 chunk = 13 / 50 [========================================] 100% Training | (Batch 5591/5591)1) | L:1.81 Training train_clean_100 chunk = 14 / 50 [========================================] 100% Training | (Batch 5656/5656)6) | L:1.78 Training train_clean_100 chunk = 15 / 50 [========================================] 100% Training | (Batch 5727/5727)7) | L:1.758 Training train_clean_100 chunk = 16 / 50 [========================================] 100% Training | (Batch 5603/5603)3) | L:1.756 Training train_clean_100 chunk = 17 / 50 [========================================] 100% Training | (Batch 5582/5582)2) | L:1.763 Training train_clean_100 chunk = 18 / 50 [========================================] 100% Training | (Batch 5649/5649)9) | L:1.723 Training train_clean_100 chunk = 19 / 50 [========================================] 100% Training | (Batch 5573/5573)3) | L:1.715 Training train_clean_100 chunk = 20 / 50 [========================================] 100% Training | (Batch 5646/5646)6) | L:1.71 Training train_clean_100 chunk = 21 / 50 [========================================] 100% Training | (Batch 5753/5753)3) | L:1.704 Training train_clean_100 chunk = 22 / 50 [========================================] 100% Training | (Batch 5708/5708)8) | L:1.709 Training train_clean_100 chunk = 23 / 50 [========================================] 100% Training | (Batch 5584/5584)4) | L:1.68 Training train_clean_100 chunk = 24 / 50 [========================================] 100% Training | (Batch 5599/5599)9) | L:1.669 Training train_clean_100 chunk = 25 / 50 [========================================] 100% Training | (Batch 5732/5732)2) | L:1.674 Training train_clean_100 chunk = 26 / 50 [========================================] 100% Training | (Batch 5682/5682)2) | L:1.636 Training train_clean_100 chunk = 27 / 50 [========================================] 100% Training | (Batch 5598/5598)8) | L:1.648 Training train_clean_100 chunk = 28 / 50 [========================================] 100% Training | (Batch 5695/5695)5) | L:1.65 Training train_clean_100 chunk = 29 / 50 [========================================] 100% Training | (Batch 5685/5685)5) | L:1.644 Training train_clean_100 chunk = 30 / 50 [========================================] 100% Training | (Batch 5733/5733)3) | L:1.623 Training train_clean_100 chunk = 31 / 50 [========================================] 100% Training | (Batch 5727/5727)7) | L:1.627 Training train_clean_100 chunk = 32 / 50 [========================================] 100% Training | (Batch 5645/5645)5) | L:1.623 Training train_clean_100 chunk = 33 / 50 [========================================] 100% Training | (Batch 5708/5708)8) | L:1.609 Training train_clean_100 chunk = 34 / 50 [========================================] 100% Training | (Batch 5589/5589)9) | L:1.607 Training train_clean_100 chunk = 35 / 50 [========================================] 100% Training | (Batch 5679/5679)9) | L:1.601 Training train_clean_100 chunk = 36 / 50 [========================================] 100% Training | (Batch 5688/5688)8) | L:1.603 Training train_clean_100 chunk = 37 / 50 [========================================] 100% Training | (Batch 5547/5547)7) | L:1.607 Training train_clean_100 chunk = 38 / 50 [========================================] 100% Training | (Batch 5738/5738)8) | L:1.576 Training train_clean_100 chunk = 39 / 50 [========================================] 100% Training | (Batch 5634/5634)4) | L:1.59 Training train_clean_100 chunk = 40 / 50 [========================================] 100% Training | (Batch 5577/5577)7) | L:1.587 Training train_clean_100 chunk = 41 / 50 [========================================] 100% Training | (Batch 5706/5706)6) | L:1.563 Training train_clean_100 chunk = 42 / 50 [========================================] 100% Training | (Batch 5637/5637)7) | L:1.576 Training train_clean_100 chunk = 43 / 50 [========================================] 100% Training | (Batch 5701/5701)1) | L:1.581 Training train_clean_100 chunk = 44 / 50 [========================================] 100% Training | (Batch 5620/5620)0) | L:1.56 Training train_clean_100 chunk = 45 / 50 [========================================] 100% Training | (Batch 5718/5718)8) | L:1.555 Training train_clean_100 chunk = 46 / 50 [========================================] 100% Training | (Batch 5696/5696)6) | L:1.562 Training train_clean_100 chunk = 47 / 50 [========================================] 100% Training | (Batch 5626/5626)6) | L:1.566 Training train_clean_100 chunk = 48 / 50 [========================================] 100% Training | (Batch 5609/5609)9) | L:1.553 Training train_clean_100 chunk = 49 / 50 [========================================] 100% Training | (Batch 5659/5659)9) | L:1.565 Training train_clean_100 chunk = 50 / 50 Exception in thread Thread-151:----------] 1.9% Training | (Batch 108/5602) | L:1.622 Traceback (most recent call last): File "/usr/lib/python3.5/", line 914, in _bootstrap_inner File "/usr/lib/python3.5/", line 862, in run self._target(*self._args, **self._kwargs) File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 458, in read_lab_fea [data_name_fea,data_set_fea,data_end_index_fea]=load_chunk(fea_scp,fea_opts,lab_folder,lab_opts,cw_left,cw_right,max_seq_length, output_folder, fea_only) File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 208, in load_chunk [data_name,data_set,data_lab,end_index_fea,end_index_lab]=load_dataset(fea_scp,fea_opts,lab_folder,lab_opts,left,right, max_sequence_length, output_folder, fea_only File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 171, in load_dataset fea_conc, lab_conc, end_index_fea, end_index_lab = _concatenate_features_and_labels(fea_chunks, lab_chunks) File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 134, in _concatenate_features_and_labels fea_conc, lab_conc = _sort_chunks_by_length(fea_conc, lab_conc) File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 124, in _sort_chunks_by_length fea_conc,lab_conc = zip(*fea_sorted) ValueError: not enough values to unpack (expected 2, got 0)

    [========================================] 100% Training | (Batch 5602/5602)2) | L:1.552 Traceback (most recent call last): File "", line 223, in [data_name,data_set,data_end_index,fea_dict,lab_dict,arch_dict]=run_nn(data_name,data_set,data_end_index,fea_dict,lab_dict,arch_dict,config_chunk_file,processed_fir File "/home/csanta/pytorch-kaldi/pytorch-kaldi/", line 582, in run_nn data_name=shared_list[0] IndexError: list index out of range

    opened by sapinedamo 16
  • using different features instead of FMLLR

    using different features instead of FMLLR


    Here in this image you have the procedure to follow when using fmllr features. But if I want to use FBANK for example do I follow also this steps and then change the paths in the cfg file to have the fbank features path? The alignments should be done with fmllr or fbank? What is the impact in both situations? You also align the training data "steps/ --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100" but then you never use the in the cfg file, so why aligning it?

    Captura de ecrã 2022-03-08, às 15 22 09

    Thanks a lot, Carlos

    opened by Miamoto 0
  •  err_te is 1

    err_te is 1

    I use my dataset and in training i am getting test error 100%,how can I fix this problem?

    res.res is shown as following : epoch 0, loss_tr=8.295877 err_tr=0.996953 loss_te=12.132339 err_te=1.000000 err_te_snt=1.000000 epoch 8, loss_tr=6.212218 err_tr=0.932266 loss_te=15.733055 err_te=1.000000 err_te_snt=1.000000 epoch 16, loss_tr=5.513986 err_tr=0.864531 loss_te=18.980804 err_te=1.000000 err_te_snt=1.000000

    And I use test set as train set res.res is shown as following: epoch 0, loss_tr=6.637052 err_tr=0.990234 loss_te=6.063669 err_te=0.984697 err_te_snt=0.973500 epoch 8, loss_tr=3.301901 err_tr=0.710049 loss_te=3.365894 err_te=0.715733 err_te_snt=0.241827 epoch 16, loss_tr=2.495244 err_tr=0.562266 loss_te=2.602354 err_te=0.585831 err_te_snt=0.141579 epoch 24, loss_tr=2.032088 err_tr=0.470723 loss_te=2.049465 err_te=0.477058 err_te_snt=0.076674 epoch 32, loss_tr=1.768699 err_tr=0.413418 loss_te=1.816929 err_te=0.428155 err_te_snt=0.064197 epoch 40, loss_tr=1.570135 err_tr=0.373398 loss_te=1.538213 err_te=0.369554 err_te_snt=0.046786 epoch 48, loss_tr=1.439105 err_tr=0.344004 loss_te=1.490549 err_te=0.357521 err_te_snt=0.038102

    opened by severusbunny 0
  • Before switch to SpeechBrain, how to use trained model in pytorch

    Before switch to SpeechBrain, how to use trained model in pytorch

    Hi, I know we have new framework SpeechBrain now (which is very fantastic), but I still have old model trained with pytorch-kaldi. I would like to use it in Pytorch now. But when I verify the result, I found the results from Pytorch code and pytorch-kaldi are very different.

    My pytorch kaldi is as below:

    class BasePhModel(Module):
        def __init__(self, options):
            super(BasePhModel, self).__init__()
            cfg_file = options["cfg"]        # the cfg file is the same one when I trained the model
            config = configparser.ConfigParser()
            config["architecture1"]["to_do"] = "forward"
            config["architecture1"]["use_cuda"] = "False"
            config["architecture2"]["to_do"] = "forward"
            config["architecture2"]["use_cuda"] = "False"
            model1_file = options["architecture1_file"]
            model2_file = options["architecture2_file"]
            self.model1 = GRU(config["architecture1"], 16)    # I use GRU + MLP in my .conf file
            self.model2 = MLP(config["architecture2"], 1024)
        def forward(self, x):
            intermediate = self.model1(x)
            y = self.model2(intermediate)
            return y

    when I use the same feature input to this model, the result is very different from the one in "forward_*_decode.ark". Is there any thing wrong with my code?

    Thank you very much!

    opened by sun-peach 0
  • Unable to run forwarding step on test set

    Unable to run forwarding step on test set

    I'm running the TIMIT LSTM on custom features, and was able to successfully train a model. Now I'm testing the model using a custom test set, but I'm experiencing an issue where the test features do not get forwarded through the trained model. Here is the terminal output:

    - Reading config file......OK!
    - Chunk creation......OK!
    Testing TIMIT_test chunk = 1 / 1
    Decoding TIMIT_test output out_dnn2
    kaldi_decoding_scripts// xx

    Additionally, the log file only indicates that hmm-info and ali-to-pdf were run, but there are no errors or warnings listed. I suspect this issue has to do with modifying the phonemap file,, and to map to 41 phones rather than 40 phones. I reverted back to using 40 phones, but the features never appear to be forwarded and the code seems to be stuck at the decoding phase.

    Do you have any suggestions on how to solve this issue?

    opened by kevinmchu 0
  • How to train/decode on reverberant speech?

    How to train/decode on reverberant speech?

    I'd like to train a model on reverberant speech using the alignments generated from the corresponding anechoic data. Currently, I'm doing something similar to TIMIT_joint_training_liGRU_fbank.cfg, where I am using the reverberant TIMIT recipe to extract the features and the anechoic recipe for lab_folder and lab_graph. I noticed that uses the lab_graph to generate the lattice rather than the graph constructed from the reverberant acoustic model.

    What is the easiest way to specify using the anechoic alignments and reverberant graph?

    opened by kevinmchu 1
Mirco Ravanelli
Mirco Ravanelli is a post-doc at the at the University of Montreal, working on deep learning at the Montreal Institute of Learning Algorithms (MILA).
Mirco Ravanelli
ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

Wang Yu 31 Aug 16, 2021
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

null 922 Dec 31, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 5, 2023
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 7.5k Feb 17, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project:

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project:

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.1k Feb 17, 2021
This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

UIS-RNN Overview This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm. UIS-RNN solves the problem of s

Google 1.4k Dec 28, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 6, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 86 Jun 11, 2021