PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

Related tags

Deep Learning pika
Overview

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

PIKA is a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi. The first release focuses on end-to-end speech recognition. We use Pytorch as deep learning engine, Kaldi for data formatting and feature extraction.

Key Features

  • On-the-fly data augmentation and feature extraction loader

  • TDNN Transformer encoder and convolution and transformer based decoder model structure

  • RNNT training and batch decoding

  • RNNT decoding with external Ngram FSTs (on-the-fly rescoring, aka, shallow fusion)

  • RNNT Minimum Bayes Risk (MBR) training

  • LAS forward and backward rescorer for RNNT

  • Efficient BMUF (Block model update filtering) based distributed training

Installation and Dependencies

In general, we recommend Anaconda since it comes with most dependencies. Other major dependencies include,

Pytorch

Please go to https://pytorch.org/ for pytorch installation, codes and scripts should be able to run against pytorch 0.4.0 and above. But we recommend 1.0.0 above for compatibility with RNNT loss module (see below)

Pykaldi and Kaldi

We use Kaldi (https://github.com/kaldi-asr/kaldi)) and PyKaldi (a python wrapper for Kaldi) for data processing, feature extraction and FST manipulations. Please go to Pykaldi website https://github.com/pykaldi/pykaldi for installation and make sure to build Pykaldi with ninja for efficiency. After following the installation process of pykaldi, you should have both Kaldi and Pykaldi dependencies ready.

CUDA-Warp RNN-Transducer

For RNNT loss module, we adopt the pytorch binding at https://github.com/1ytic/warp-rnnt

Others

Check requirements.txt for other dependencies.

Get Started

To get started, check all the training and decoding scripts located in egs directory.

I. Data preparation and RNNT training

egs/train_transducer_bmuf_otfaug.sh contains data preparation and RNNT training. One need to prepare training data and specify the training data directory,

#training data dir must contain wav.scp and label.txt files
#wav.scp: standard kaldi wav.scp file, see https://kaldi-asr.org/doc/data_prep.html 
#label.txt: label text file, the format is, uttid sequence-of-integer, where integer
#           is one-based indexing mapped label, note that zero is reserved for blank,  
#           ,eg., utt_id_1 3 5 7 10 23 
train_data_dir=

II. Continue with MBR training

With RNNT trained model, one can continued MBR training with egs/train_transducer_mbr_bmuf_otfaug.sh (assuming using the same training data, therefore data preparation is omitted). Make sure to specify the initial model,

--verbose \
--optim sgd \
--init_model $exp_dir/init.model \
--rnnt_scale 1.0 \
--sm_scale 0.8 \

III. Training LAS forward and backward rescorer

One can train a forward and backward LAS rescorer for your RNN-T model using egs/train_las_rescorer_bmuf_otfaug.sh. The LAS rescorer will share the encoder part with RNNT model, and has extra two-layer LSTM as additional encoder, make sure to specify the encoder sharing as,

--num_batches_per_epoch 526264 \
--shared_encoder_model $exp_dir/final.model \
--num_epochs 5 \

We support bi-directional LAS rescoring, i.e., forward and backward rescoring. Backward (right-to-left) rescoring is achieved by reversing sequential labels when conducting LAS model training. One can easily perform a backward LAS rescorer training by specifying,

--reverse_labels

IV. Decoding

egs/eval_transducer.sh is the main evluation script, which contains the decoding pipeline. Forward and backward LAS rescoring can be enabled by specifying these two models,

##########configs#############
#rnn transducer model
rnnt_model=
#forward and backward las rescorer model
lasrescorer_fw=
lasrescorer_bw=

Caveats

All the training and decoding hyper-parameters are adopted based on large-scale (e.g., 60khrs) training and internal evaluation data. One might need to re-tune hyper-parameters to acheive optimal performances. Also the WER (CER) scoring script is based on a Mandarin task, we recommend those who work on different languages rewrite scoring scripts.

References

[1] Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition, Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su, Dong Yu, InterSpeech 2018

[2] Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition, Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu, InterSpeech 2020

Citations

@inproceedings{Weng2020,
  author={Chao Weng and Chengzhu Yu and Jia Cui and Chunlei Zhang and Dong Yu},
  title={{Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={966--970},
  doi={10.21437/Interspeech.2020-1221},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1221}
}

@inproceedings{Weng2018,
  author={Chao Weng and Jia Cui and Guangsen Wang and Jun Wang and Chengzhu Yu and Dan Su and Dong Yu},
  title={Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={761--765},
  doi={10.21437/Interspeech.2018-1030},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1030}
}

Disclaimer

This is not an officially supported Tencent product

Comments
  • eval_transducer.sh decoder error

    eval_transducer.sh decoder error

    When I run eval_transducer.sh, got the error:

    Traceback (most recent call last): File "/work0/pika//decoder/decode_transducer.py", line 264, in main() File "/work0/pika//decoder/decode_transducer.py", line 133, in main len_batch + 100) File "/work0/pika/decoder/transducer_decoder.py", line 182, in decode_batch b.advance(out[:, j], self.t_idx[:, j], x_len[j]) File "/work0/pika/decoder/beam_transducer.py", line 140, in advance self.scores -= self.lm_scorer_scale * self.lm_scores[prev_k] IndexError: tensors used as indices must be long, byte or bool tensors

    And I print some outputs: beam_scores tensor([ -0.6430, -12.0813, -6.1094, ..., -8.2931, -11.2605, -11.8402], device='cuda:0', grad_fn=) torch.Size([4232]) flat_beam_scores tensor([ -0.6430, -12.0813, -6.1094, ..., -8.2931, -11.2605, -11.8402], device='cuda:0', grad_fn=) torch.Size([4232]) best_scores_id tensor([ 0, 156, 2995, 3703, 30, 718, 850, 1150], device='cuda:0') torch.Size([8]) <class 'torch.Tensor'> num_words 4232 <class 'int'> prev_k tensor([0.0000, 0.0369, 0.7077, 0.8750, 0.0071, 0.1697, 0.2009, 0.2717], device='cuda:0') <class 'torch.Tensor'>

    I use the pytorch 1.7.1.

    Any suggestions? Thanks!

    opened by bliunlpr2020 4
  • TypeError: Required argument 'vtln_warp' (pos 3) not found

    TypeError: Required argument 'vtln_warp' (pos 3) not found

    bash train_transducer_bmuf_otfaug.sh

    /utils/compute_global_cmvn.py", line 64, in vtnl_warp=1.0) TypeError: Required argument 'vtln_warp' (pos 3) not found

    opened by xbsdsongnan 2
  • Evaluation Error

    Evaluation Error

    Exception in thread Thread-1: Traceback (most recent call last): File "/home/speech/anaconda3/envs/pika/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/speech/anaconda3/envs/pika/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/home/speech/\udce6\udca1\udc8c\udce9\udc9d\udca2/pika/loader/otf_utt_loader.py", line 56, in put_thread for item in generator(*gen_args): File "/home/speech/\udce6\udca1\udc8c\udce9\udc9d\udca2/pika/loader/utt_loader.py", line 190, in utt_generator splice(feats, args.lctx, args.rctx)[::args.stride] ValueError: could not broadcast input array from shape (143,78) into shape (143,240)

    opened by yihuan-github 1
  • Suggestion RE data prep

    Suggestion RE data prep

    If you use lhotse for data-preparation (https://github.com/lhotse-speech/lhotse) dependency management may be less painful; it's designed for easy use from Python.

    opened by danpovey 1
  • TypeError: compute_features() missing required argument 'vtln_warp' (pos 3)

    TypeError: compute_features() missing required argument 'vtln_warp' (pos 3)

    ========= Done with split labels and generate lst============ Traceback (most recent call last): File "/home/speech/桌面/pika/utils/compute_global_cmvn.py", line 62, in feats = fbank.compute_features(wave_1ch,args.sample_rate,vtnl_warp=1.0) TypeError: compute_features() missing required argument 'vtln_warp' (pos 3)

    opened by yihuan-github 0
  • RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u

    RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u

    Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

    opened by xbsdsongnan 20
Owner
Research repositories.
null
Lightweight mmm - Lightweight (Bayesian) Media Mix Model

Lightweight (Bayesian) Media Mix Model This is not an official Google product. L

Google 342 Jan 3, 2023
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 5, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17k Feb 11, 2021
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

null 2 Jan 5, 2023
A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

Daniel Hirsch 13 Nov 4, 2022
MMGeneration is a powerful toolkit for generative models, based on PyTorch and MMCV.

Documentation: https://mmgeneration.readthedocs.io/ Introduction English | 简体中文 MMGeneration is a powerful toolkit for generative models, especially f

OpenMMLab 1.3k Dec 29, 2022
This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

BUPT GAMMA Lab 519 Jan 2, 2023
PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

Amin Rezaei 157 Dec 11, 2022
MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving.

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving. It is a comprehensive framework for research purpose that integrates popular MWP benchmark datasets and typical deep learning-based MWP algorithms.

null 119 Jan 4, 2023
Multi-Modal Machine Learning toolkit based on PyTorch.

简体中文 | English TorchMM 简介 多模态学习工具包 TorchMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 TorchMM 初始版本 v1.0 特性 丰富的任务场景:工具

njustkmg 1 Jan 5, 2022
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 63 Jan 2, 2023
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

sarulab-speech 24 Sep 7, 2022
Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

This repository is used to suspend the results of our paper "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement"

ScorpioMiku 19 Sep 30, 2022