kaldi-asr/kaldi is the official location of the Kaldi project.

Overview

Build Status Gitpod Ready-to-Code Kaldi Speech Recognition Toolkit

To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux; Darwin; and Cygwin (has not been tested on more "exotic" varieties of UNIX). For Windows installation instructions (excluding Cygwin), see windows/INSTALL.

To run the example system builds, see egs/README.txt

If you encounter problems (and you probably will), please do not hesitate to contact the developers (see below). In addition to specific questions, please let us know if there are specific aspects of the project that you feel could be improved, that you find confusing, etc., and which missing features you most wish it had.

Kaldi information channels

For HOT news about Kaldi see the project site.

Documentation of Kaldi:

  • Info about the project, description of techniques, tutorial for C++ coding.
  • Doxygen reference of the C++ code.

Kaldi forums and mailing lists:

We have two different lists

  • User list kaldi-help
  • Developer list kaldi-developers:

To sign up to any of those mailing lists, go to http://kaldi-asr.org/forums.html:

Development pattern for contributors

  1. Create a personal fork of the main Kaldi repository in GitHub.
  2. Make your changes in a named branch different from master, e.g. you create a branch my-awesome-feature.
  3. Generate a pull request through the Web interface of GitHub.
  4. As a general rule, please follow Google C++ Style Guide. There are a few exceptions in Kaldi. You can use the Google's cpplint.py to verify that your code is free of basic mistakes.

Platform specific notes

PowerPC 64bits little-endian (ppc64le)

Android

  • Kaldi supports cross compiling for Android using Android NDK, clang++ and OpenBLAS.
  • See this blog post for details.
Comments
  • show L2 norm of parameters during training.

    show L2 norm of parameters during training.

    In addition, set affine to false for batchnorm layers and switch to SGD optimizer.

    The training is still running and a screenshot of the L2-norms of the training parameters is as follows:

    Screen Shot 2020-02-12 at 09 05 51

    I will post the decoding results once it is done.

    opened by csukuangfj 67
  • Wake-word detection

    Wake-word detection

    Results of the regular LF-MMI based recipes:

    Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper)

    SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper)

    E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results.

    opened by freewym 67
  • Multilingual using modified configs

    Multilingual using modified configs

    This is a modified multilingual setup based on new xconfig and training scripts. In this setup, xconfig used to create network configuration for multilingual training. Also the egs generation moved out of training script and multilingual egs dir passed to train_raw_dnn.py. Also a new script added for average posterior computation and prior adjustment.

    opened by pegahgh 65
  • CUDA context creation problem in nnet3 training with

    CUDA context creation problem in nnet3 training with "--use-gpu=wait" option

    I am not sure if this is a Kaldi issue but I thought someone might have an idea.

    First some context. I am trying to tune a few TDNN chain models on a workstation with 2 Maxwell Titan X 12GB cards. The data sets I am working with are fairly small (Babel full language packs with 40-80 hours audio). Initially I set the number of initial and final training jobs to 2 and trained the models with scripts adapted from babel and swbd recipes. While this worked without any issues, I noticed that the models were overtraining, so I tried tuning relu-dim, number of epochs and xent-regularize with one of the language packs to see if I could get a better model. Eventually the best model I got was with a single epoch and xent-regularize=0.25 (WER base model: 45.5% vs best model: 41.4%). To see if the training schedule might have any further effects on the model performance, I also tried training with --num-jobs-initial=2, --num-jobs-final=8 after setting the GPUs to "default" compute mode to allow the creation of multiple CUDA contexts. I added 2 seconds delay between individual jobs so that earlier jobs would start allocating device memory before a new job is scheduled on the device with the largest amount of free memory. This mostly worked fine, except towards the end when 8 jobs were distributed 5-3 between the two cards. The resulting model had 40.9% WER after 2 epochs and the log probability difference between the train and validation sets was also smaller than before. It seems like the training schedule (number of jobs, learning rate, etc. at each iteration) has an effect on the model performance in this small data scenario. Maybe averaging gradients across a larger number of jobs is beneficial, or the learning rate schedule is somehow tuned for this type of training schedule.

    Now the actual problem. Since large number of jobs seemed to work better for me, I wanted to remove the job delay hack, set GPUs back to "exclusive process" compute mode and take advantage of the --use-gpu=wait option while scheduling the training jobs. However, it seems like I am missing something. If I launch multiple training processes with the --use-gpu=wait option while GPUs are in "exclusive process" compute mode, only one process can create a CUDA context on a given GPU card even after that one process completes. My expectation was that other processes would wait for the GPUs to be available and then one by one acquire the GPUs and complete their work. I added a few debug statements to GetCudaContext function to see what the problem was. cudaDeviceSynchronize call returns "all CUDA-capable devices are busy or unavailable" even after processes running on the GPUs are long gone. Any ideas?

    opened by dogancan 63
  • Modify TransitionModel for more compact chain-model graphs

    Modify TransitionModel for more compact chain-model graphs

    Place holder for addressing #1031 . WIP log:

    1. self_loop_pdf_class added to HmmState, done
    2. self_loop_pdf added to Tuple in TransitionModel. done
    3. another branch of ContextDependencyInterface::GetPdfInfo. ugly done
    4. create test code for new structures. done
    5. back compatability for all read code. done
    6. normal HMM validation using RM. done
    7. chain code modification. done
    8. chain validation using RM. done
    9. iterate 2nd version of GetPdfInfo. done
    10. documents and comments. tbd...
    opened by naxingyu 63
  • add PyTorch's DistributedDataParallel training.

    add PyTorch's DistributedDataParallel training.

    support distributed training across multiple GPUs.

    TODOs:

    • there are lots of code duplicates

    Part of the training log

    2020-02-19 13:55:10,646 INFO [ddp_train.py:160] Device (1) processing 1100/4724(23.285351%) global average objf: -0.225449 over 6165760.0 frames, current batch average objf: -0.130735 over 6400 frames, epoch 0
    2020-02-19 13:55:55,251 INFO [ddp_train.py:160] Device (0) processing 1200/4724(25.402202%) global average objf: -0.216779 over 6732672.0 frames, current batch average objf: -0.123979 over 3840 frames, epoch 0
    2020-02-19 13:55:55,252 INFO [ddp_train.py:160] Device (1) processing 1200/4724(25.402202%) global average objf: -0.216412 over 6738176.0 frames, current batch average objf: -0.132368 over 4736 frames, epoch 0
    

    The training seems working.

    opened by csukuangfj 62
  • Is there any speaker diarization documentation and already trained model?

    Is there any speaker diarization documentation and already trained model?

    Hi there, thanks for Kaldi :)

    I want to perform speaker diarization on a set of audio recordings. I believe Kaldi recently added the speaker diarization feature. I have managed to find this link, however, I have not been able to figure out how to use it since there is very little documentation. Also, may I ask is there any already trained model on conversions in English that I can use off-the-shelf, please?

    Thanks a lot!

    opened by bwang482 61
  • expose egs as Dataloader

    expose egs as Dataloader

    Expose egs as a Dataloader in PyTorch, training time now decreased from 150mins to 90mins for 6 epochs with 4 workers.

    RESULT

    ||TDNN-F(Pytorch, Adam, delta dropout without ivector ) from @fanlu | TDNN-F(Pytorch, Adam, delta dropout without ivector ) this PR 2nd run | TDNN-F(Pytorch, Adam, delta dropout without ivector ) this PR 1st run | this Pr with commit 0d8aada to make dropout go to zero at the end | |--|--|--|--|--| |dev_cer|6.10|6.13|6.18|6.12 |dev_wer|13.86|13.89|13.96|13.92 |test_cer|7.14|7.19|7.20|7.26 |test_wer|15.49|15.54|15.66|15.63 |training_time|151mins|88mins|84mins|

    WER/CER increase may come from:

    • Shuffle, we do not shuffle egs-minibatch during each epoch.
    • Dropout, we use pseudo_epoch (one scp file is one pseudo-epoch) to compute data_fraction in dropout, that is relatively too much coarse-grained than using batch_idx

    Note that I have tried copy|shuffle|merge in dataloader (see code below"), but seems that it will take as much time as (or even a little more time than) the original approach (egs as a Dataset), I may do further experiment to look into this:

     scp_rspecifier = scp_file_to_process
     egs_rspecifier = 'ark,bg:nnet-chain-copy-egs --frame-shift .. scp:scp_rspecifier ark:- | \
                            nnet3-chain-shuffle-egs --buffer-size .. --srand .. ark:- ark:- | \
                            nnet3-chain-merge-egs --minibatch-size .. ark:- ark:- |'
     with SequentialNnetChainExampleReader(egs_rspecifier) as example_reader:
           for key, eg in example_reader:
                 batch = self.collate_fn(eg)
                 yield pseudo_epoch, batch
    

    TODO

    • [ ] Split egs to more scp files (currently 56 files) to see whether it will get fine-grained in data_fraction of dropout or not.
    • [ ] Do further experiment and trace for approach copy|shuffle|merge in dataloader to confirm the bottleneck of this approach.
    • [ ] Profile first epoch of training to see why it takes so much time, as we can see now that first epoch training would take most part of time of the whole training time, no matter what approach (egs as dataset or dataloader) we use.
    opened by qindazhu 58
  • [src] CUDA Online/Offline pipelines + light batched nnet3 driver

    [src] CUDA Online/Offline pipelines + light batched nnet3 driver

    This is still WIP. Requires some cleaning, integrating the online mfcc into a separate PR (cf below), and some other things.

    Implementing a low-latency high-throughput pipeline designed for online. It uses the GPU decoder, the GPU mfcc/ivector, and a new lean nnet3 driver (including nnet3 context switching on device).

    • Online/Offline pipelines

    The online pipeline can be seen as taking a batch as input, and then processing a very regular algorithm of calling feature extraction, nnet3, decoder, and postprocessing on that same batch, in a synchronous fashion (i.e. all of those steps will run when DecodeBatch is called. Nothing is sent to some async pipelines along the way). What happens when you run DecodeBatch is very regular, and because of that it is able to guarantee some latency constraints (because the way the code will be executed is very predicable). It also focus on being lean, avoiding reallocations or recomputations (such as recompiling nnet3).

    The online pipeline takes care of computing [MFCC, iVectors], nnet3, decoder, postprocessing. It can either uses as input chunks of raw audio (and then compute mfcc->nnet3->decoder->postprocessing), or it can be called directly with mfcc features/ivectors (and then compute nnet3->decoder->postprocessing). The second possibility is used by the offline wrapper when use_online_ivectors=false.

    The old offline pipeline is replaced by a new offline pipeline which is mostly a wrapper around the online pipeline. What it does is having an offline-friendly API (accepting full utterances as input instead of chunks), and has the possibility to pre-compute ivectors on the full utterance first (use_online_ivectors = false). It then calls the online pipeline internally to compute most of the work.

    The easiest way to test the online pipeline end-to-end is to call it through the offline wrapper for now, with use_online_ivectors = true. Please note that ivectors will be ignored for now in this full end-to-end online (i.e. when use_online_ivectors=true). That's because the GPU ivectors are not yet ready for online. However the pipeline code is ready. The offline pipeline with use_online_ivectors=false should be fully functional and returns the same WER than before.

    • Light nnet3 driver designed for GPU and online

    It includes a new light nnet3 driver designed for the GPU. The key idea is that it's usually better to waste some flops to compute things such as partial chunks or partial batches. For example for the last chunk (nframes=17) of an utterance, that chunk can be smaller than max_chunk_size (50 frames per default). It that case compiling a new nnet3 computation for that exact chunk size is slower than just running it for a chunk size of 50 and ignoring the invalid output.

    Same idea for batch_size: The nnet3 computation will always run a fixed minibatch size. It is defined as minibatch_size = std::min(max_batch_size, MAX_MINIBATCH_SIZE). MAX_MINIBATCH_SIZE is defined to be large enough to hide the kernel launch latency and increase the arithmetic intensity of the GEMMs, but not larger, so that partial batches will not be slowed down too much (i.e. avoiding to run a minibatch of size 512 where only 72 utterances are valid). MAX_MINIBATCH_SIZE is currently 128. We'll then run nnet3 multiple time on the same batch if necessary. If batch_size=512, we'll run nnet3 (with minibatch_size=128) four times.

    The context-switch (to restore the nnet left and right context, and ivector) is done on device. Everything that needs context-switch is using the concept of channels, to be consistent with the GPU decoder.

    Those "lean" approaches gave us better performance, and a drop in memory usage (total GPU memory usage from 15GB to 4GB for librispeech and batch size 500). It also removes the need for "high level" multithreading (i.e. cuda-control-threads).

    • Parameters simplification

    Dropping some parameters because the new code design doesn't require them (--cuda-control-threads, the drain size parameter). In theory the configuration should be greatly simplified (only --max-batch-size needs to be set, others are optional).

    • Adding batching and online to GPU mfcc

    The code in cudafeat/ is modifying the mfcc GPU code. MFCC features can now be batched and processed online (restoring a few hundreds frames of past audio for each new chunk). That code was implemented by @mcdavid109 (thanks!). We'll create a separate PR for this, it requires some cleaning, and a large part of the code is redundant with existing mfcc files. GPU batched online ivectors and cmvn are WIP.

    • Indicative measurements

    When used with use_online_ivectors=false, that code reach 4,940 XRTF on librispeech/test_clean, with a latency around 6x realtime for max_batch_size=512 (latency would be lower with smaller max_batch_size). One use case where that GPU pipeline can be used in a situation where only latency matters (and not throughput) is for instance on the jetson nano, where some initial runs were measured at 5-10x realtime latency for a single channel (max_batch_size=1) on librispeech/clean. Those measurements are indicative only - more reliable measurements will be done in the future.

    opened by hugovbraun 56
  • Online2 NNet3 TCP server program

    Online2 NNet3 TCP server program

    Several people asked for this and I feel like it would be a nice addition to the project.

    The protocol is much simpler than the audio-server program I did a while ago - audio in -> text out.

    The way it's made now is nice for a live demo (I added some commands to the doxygen docs), but may still lack some features for real-life use.

    The main issue I have is that the new decoder is slightly different than before. The old decoder had a way to check which part of the output is final and which is "partial". This time, I can only check the current best path every N seconds (eg once per second of input audio). I use endpointing to determine when to finalize decoding.

    Now, what would be really nice it to have online speech detection and speaker diarization included with this, but I know it's probably not happening too soon. What can be done (and I may do it myself if I find time) is a multithreaded version of the program with shared acoustic model and FST. Also, I bet it could be possible to combine the grammar version with the server to allow runtime vocabulary modification.

    I also have a web interface that works with this server, but I'm not sure if it would fit the main Kaldi repo, so I'll probably make a separate repo for that (if anyone wants it).

    I'm open to comments and suggestions.

    opened by danijel3 56
  • Xvectors: DNN Embeddings for Speaker Recognition

    Xvectors: DNN Embeddings for Speaker Recognition

    Overview This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.

    This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).

    This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.

    Example Generation An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:

    • sid/nnet3/xvector/get_egs.sh — Top-level script for example creation
    • sid/nnet3/xvector/allocate_egs.py — This script is responsible for deciding what is contained in the examples and what archives they belong to.
    • src/nnet3bin/nnet3-xvector-get-egs — The binary for creating the examples. It constructs examples based on the ranges.* file.

    Training This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:

    • sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh — Does example creation, creates the xconfig, and trains the nnet

    Extracting XVectors After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions

    • sid/nnet3/xvector/extract_xvectors.sh — Extracts embeddings from the xvector DNN. This is analogous to extract_ivectors.sh.
    • src/nnet3bin/nnet3-xvector-compute — Does the forward computation for the xvector DNN (variable-length input, with a single output).

    Augmentation We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:

    • steps/data/augment_data_dir.py — Similar to reverberate_data_dir.py but only handles additive noise.
    • egs/sre16/v1/run.sh — PLDA training list is augmented with reverb and MUSAN audio
    • egs/sre16/v2/run.sh — DNN training and PLDA list are augmented with reverb and MUSAN.

    SRE16 Recipe The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions

    • egs/sre16/v1/local/ — A bunch of dataprep scripts
    • egs/sre16/v2/local/nnet3/xvector/prepare_feats_for_egs.sh -- A script that applies cmvn and removes silence frames and writes the results to disk. This is what the nnet examples are generated from.
    • egs/sre16/v1/run.sh — ivector top-level script
    • egs/sre16/v2/run.sh — xvector top-level script

    Results for this example:

      xvector (from v2) EER: Pooled 8.76%, Tagalog 12.73%, Cantonese 4.86%
      ivector (from v1) EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%
    

    Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.

    opened by david-ryan-snyder 55
  • Memory

    Memory "leak" of cudadecoder's arc instantiations

    Hi, I have recently been trying to track down progressive memory growth in Triton's Kaldi backend (https://github.com/NVIDIA/DeepLearningExamples/issues/1240), and in pursuit of that I've successfully tried to reproduce the issue with a bare Kaldi setup.

    I don't have any understanding of Kaldi's internals, so some of the information given here might seem vague or might be outright non sensical but I hope it gives the general idea.

    Basically, the issue seems to be that the cudadecoder keeps max_active number of arc instantiations per every computation of an audio chunk (frame computation), that never seems to get freed until the decoder's destructor is called.

    As far as I can understand logically, the arc instantiations are relevant only for a given correlation id / audio stream, and there is no meaningful way to use these instantiations to improve the accuracy of other, unrelated audio streams / correlation ids. So, it seems fair to expect that all arc instantiations relating to a given correlation ID get freed once the last chunk for that ID has been processed. However this doesn't seem to happen practically.

    This becomes a huge problem in the Triton Kaldi backend since it constantly takes in new inputs from clients, and the memory usage climbs rapidly with every inference (reaching up to 30G for large WAVs)

    Steps to reproduce:

    Use this shell script to launch an inference for the LibriSpeech dataset:

    #!/bin/sh
    
    # --max-active=10
    
    /bin/time -v ./batched-wav-nnet3-cuda-online \
        --max-batch-size=1100 \
        --cuda-use-tensor-cores=true \
        --cuda-worker-threads=12 \
        --cuda-decoder-copy-threads=4 \
        --print-hypotheses \
        --cuda-use-tensor-cores=true \
        --main-q-capacity=30000 \
        --aux-q-capacity=400000 \
        --beam=10 \
        --cuda-worker-threads=10 \
        --num-channels=4000 \
        --lattice-beam=7 \
        --max-active=10000 \
        --frames-per-chunk=50 \
        --acoustic-scale=1.0 \
        --config=/data/models/LibriSpeech/conf/online.conf \
        --word-symbol-table=/data/models/LibriSpeech/words.txt \
        /data/models/LibriSpeech/final.mdl \
        /data/models/LibriSpeech/HCLG.fst \
        scp:/data/datasets/LibriSpeech/test_clean/wav_conv.scp \
        'ark:|gzip -c > /tmp/lat.gz'
    

    Notice that the memory usage keeps climbing and remains constant after all the inferences have been performed. It only gets freed once the whole decoder object is destroyed. The expected behaviour is that the memory usage keeps fluctuating up and down as a consequence of properly releasing the memory for the arc instantiations of the correlation IDs that have been completely inferred.

    The program's memory usage caps out at around 6G in case of max_active=10:

    Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10** ... ark:|gzip -c > /tmp/lat.gz"
            User time (seconds): 30.66
            ...
            Maximum resident set size (kbytes): **5989872**
    

    I'm showing Maximum resident set size (i.e. the peak memory usage) because the usage actually never goes down after peaking due to the leak. This can be confirmed by adding a sleep before return-ing here: https://github.com/kaldi-asr/kaldi/blob/master/src/cudadecoderbin/batched-wav-nnet3-cuda-online.cc#L316

    And at 8G in case of max_active=10000:

    Command being timed: "./batched-wav-nnet3-cuda-online --max-batch-size=1100 ... **--max-active=10000** ... ark:|gzip -c > /tmp/lat.gz"             
    
            User time (seconds): 29.87
            ...
            Maximum resident set size (kbytes): **8204936**
    

    This correlation between the memory usage and the value of max_active led me to believe that the arc instantiations are not being freed as soon as a given correlation ID's last chunk has been processed.

    bug 
    opened by git-bruh 5
  • Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file

    Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file

    anyone who can help me regarding this error when i execute run.sh file

    ===== FEATURES EXTRACTION =====

    steps/make_mfcc.sh --nj 1 --cmd run.pl data/train exp/make_mfcc/train mfcc utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.G3YQ/utts 2022-12-15 15:47:53.033696862 +0000 +++ /tmp/kaldi.G3YQ/utts.utt2dur 2022-12-15 15:47:53.053696720 +0000 @@ -2,28 +2,4 @@ spk1_10 -spk1_100 -spk1_101 ... [Lengths are /tmp/kaldi.G3YQ/utts=435 versus /tmp/kaldi.G3YQ/utts.utt2dur=142] steps/make_mfcc.sh --nj 1 --cmd run.pl data/test exp/make_mfcc/test mfcc utils/validate_data_dir.sh: Error: in data/test, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.uq6T/utts 2022-12-15 15:47:53.089696463 +0000 +++ /tmp/kaldi.uq6T/utts.utt2dur 2022-12-15 15:47:53.105696349 +0000 @@ -2,28 +2,4 @@ spk1_10 -spk1_100 -spk1_101 ... [Lengths are /tmp/kaldi.uq6T/utts=435 versus /tmp/kaldi.uq6T/utts.utt2dur=142] steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc steps/compute_cmvn_stats.sh: no such file data/train/feats.scp steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc steps/compute_cmvn_stats.sh: no such file data/test/feats.scp

    kaldi10-TODO 
    opened by ShakeelOfficials 0
  • Faster Cuda Decoder

    Faster Cuda Decoder

    There were several issues recently discovered with the cuda decoder in both offline and online mode.

    After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

    Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:

    image

    This was achieved via the following hyperparameter sweep:

    for chunk_size in 21 30 40 50; do
        for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
            max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
            /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
    000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
    pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
    /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
    sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
    ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
    output.log                                                                                                                                                                                                
            cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
            echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file
        done
    done
    

    Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

    This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

    opened by galv 11
  • SRILM: allow bypassing download/extraction during automated installation

    SRILM: allow bypassing download/extraction during automated installation

    The SRILM website download procedure seems to have been broken for a while. This PR allows you to bypass downloading the archive from the SRI website and/or extracting the archive into the source tree (if you have either obtained via other means), while still taking advantage of the rest of the automated installation script from Kaldi.

    opened by daanzu 0
  • openfst fails to compile on i686

    openfst fails to compile on i686

    Hello, I'm trying to package my program which uses Vosk which is built upon Kaldi. I also need to package openfst which is what is giving me trouble.

    Openfst compiles just fine on my x86_64 machine, and even cross compiling to i686, but a test fails in the continuous integration when compiling i686 natively.

    It is the float equality from the ./configure file which runs during the install phase (it's run again during the check phase).

    This is the output of the build system:

    configure: error: Test float equality failed! Compile with -msse -mfpmath=sse if using g++.
    See `config.log' for more details
    => ERROR: vosk-api-0.3.43_1: pre_build: './configure --host="${XBPS_TRIPLET}" ${CROSS_BUILD:+--host="${XBPS_CROSS_TRIPLET}"} --build="${XBPS_TRIPLET}" --prefix="${wrksrc}/kaldi-${_kaldi_commit}/tools/openfst" --enable-static --enable-shared --enable-far --enable-ngram-fsts --enable-lookahead-fsts --with-pic --disable-bin' exited with 1
    => ERROR:   in pre_build() at srcpkgs/vosk-api/template:57
    

    I have tried adding the options it suggests and some, but I get the same error.

    I think it passes when cross compiling because it's actually testing my x86_64. That said, I'm concerned about the underlying float equality issue on i686, even if I skip the test when cross compiling.

    I asked the Vosk developer here but didn't get anywhere: https://github.com/alphacep/vosk-api/issues/1161

    This is my build template for Void Linux: https://github.com/void-linux/void-packages/pull/39015

    This is someone elses build template for Alpine Linux: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/41097 I see they have had the same issue and limited the architectures to 64 bit.

    I will only target 64 bit architectures if necessary, but I'd be shame, my program seems to work well on my old i686 laptop when cross compiled.

    Any thoughts? Is this a build issue or are 32 bit architectures not supported?

    Thank you. Sorry this is more an openfst issue than a kaldi one but I saw you had accepted similar issues, and openfst doesn't seem to have a good place for issues.

    bug 
    opened by JohnGebbie 2
Owner
Kaldi
A state-of-the-art automatic speech recognition toolkit
Kaldi
Brief idea about our project is mentioned in project presentation file.

Brief idea about our project is mentioned in project presentation file. You just have to run attendance.py file in your suitable IDE but we prefer jupyter lab.

Dhruv ;-) 3 Mar 20, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
Autonomous Driving project for Euro Truck Simulator 2

hope-autonomous-driving Autonomous Driving project for Euro Truck Simulator 2 Video: How is it working ? In this video, the program processes the imag

Umut Görkem Kocabaş 36 Nov 6, 2022
This project modify tensorflow object detection api code to predict oriented bounding boxes. It can be used for scene text detection.

This is an oriented object detector based on tensorflow object detection API. Most of the code is not changed except for those related to the need of

Dafang He 30 Oct 22, 2022
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
Hand gesture detection project with aweome UI implementation.

an awesome hand gesture detection project for you to be creative! Imagination is the limit to do with this project.

AR Ashraf 39 Sep 26, 2022
This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.

Pinch-zoom This is a python project based on real-time hand-gesture detection, to zoom in or out, using the distance between the index finger and the

Harshit Bhalla 6 Jul 11, 2022
In this project we will be using the live feed coming from the webcam to create a virtual mouse with complete functionalities.

Virtual Mouse Using OpenCV In this project we will be using the live feed coming from the webcam to create a virtual mouse using hand tracking. Projec

Hassan Shahzad 8 Dec 20, 2022
OpenGait is a flexible and extensible gait recognition project

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Shiqi Yu 335 Dec 22, 2022
Shape Detection - It's a shape detection project with OpenCV and Python.

Shape Detection It's a shape detection project with OpenCV and Python. Setup pip install opencv-python for doing AI things. pip install simpleaudio fo

null 1 Nov 26, 2022
This is a real life mario project using python and mediapipe

real-life-mario This is a real life mario project using python and mediapipe How to run to run this just run - realMario.py file requirements This req

Programminghut 42 Dec 22, 2022
This project is basically to draw lines with your hand, using python, opencv, mediapipe.

Paint Opencv ?? This project is basically to draw lines with your hand, using python, opencv, mediapipe. Screenshoots ?? Tools ⚙️ Python Opencv Mediap

Williams Ismael Bobadilla Torres 3 Nov 17, 2021
Computer vision applications project (Flask and OpenCV)

Computer Vision Applications Project This project is at it's initial phase. This is all about the implementation of different computer vision techniqu

Suryam Thapa 1 Jan 26, 2022
This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

Chandru 2 Feb 20, 2022
A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

tooraj taraz 3 Feb 10, 2022
Official implementation of Character Region Awareness for Text Detection (CRAFT)

CRAFT: Character-Region Awareness For Text detection Official Pytorch implementation of CRAFT text detector | Paper | Pretrained Model | Supplementary

Clova AI Research 2.5k Jan 3, 2023
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 3, 2023
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

null 213 Nov 12, 2022