Production First and Production Ready End-to-End Speech Recognition Toolkit

Last update: Jan 4, 2023

Related tags

Deep Learning pytorch transformer speech-recognition automatic-speech-recognition production-ready asr conformer e2e-models

Overview

WeNet

中文版

We share neural Net together.

The main motivation of WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models, to reduce the effort of productionizing E2E models, and to explore better E2E models for production.

Highlights

Production first and production ready: The python code of WeNet meets the requirements of TorchScript, so the model trained by WeNet can be directly exported by Torch JIT and use LibTorch for inference. There is no gap between the research model and production model. Neither model conversion nor additional code is required for model inference.
Unified solution for streaming and non-streaming ASR: WeNet implements Unified Two Pass (U2) framework to achieve accurate, fast and unified E2E model, which is favorable for industry adoption.
Portable runtime: Several demos will be provided to show how to host WeNet trained models on different platforms, including server x86 and on-device android.
Light weight: WeNet is designed specifically for E2E speech recognition, with clean and simple code. It is all based on PyTorch and its corresponding ecosystem. It has no dependency on Kaldi, which simplifies installation and usage.

Performance Benchmark

Please see examples/$dataset/s0/README.md for benchmark on different speech datasets.

Mandarin Chinese
- AIShell-1
- AIShell-2
- Multi-CN (combining several open source Chinese corpora)
English
- LibriSpeech
- GigaSpeech

Installation

Clone the repo

git clone https://github.com/wenet-e2e/wenet.git

Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env:

conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

Optionally, if you want to use x86 runtime or language model(LM), you have to build the runtime as follows. Otherwise, you can just ignore this step.

# runtime build requires cmake 3.14 or above
cd runtime/server/x86
mkdir build && cd build && cmake .. && cmake --build .

Discussion & Communication

Please visit Discussions for further discussion.

For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet. We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

If you can not access the QR image, please access it on gitee.

Or you can directly discuss on Github Issues.

Contributors

Acknowledge

We borrowed a lot of code from ESPnet for transformer based modeling.
We borrowed a lot of code from Kaldi for WFST based decoding for LM integration.
We referred EESEN for building TLG based graph for LM integration.
We referred to OpenTransformer for python batch inference of e2e models.

Citations

@inproceedings{yao2021wenet,
  title={WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit},
  author={Yao, Zhuoyuan and Wu, Di and Wang, Xiong and Zhang, Binbin and Yu, Fan and Yang, Chao and Peng, Zhendong and Chen, Xiaoyu and Xie, Lei and Lei, Xin},
  booktitle={Proc. Interspeech},
  year={2021},
  address={Brno, Czech Republic }
  organization={IEEE}
}

@article{zhang2020unified,
  title={Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition},
  author={Zhang, Binbin and Wu, Di and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Yang, Chao and Guo, Liyong and Hu, Yaguang and Xie, Lei and Lei, Xin},
  journal={arXiv preprint arXiv:2012.05481},
  year={2020}
}

@article{wu2021u2++,
  title={U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition},
  author={Wu, Di and Zhang, Binbin and Yang, Chao and Peng, Zhendong and Xia, Wenjing and Chen, Xiaoyu and Lei, Xin},
  journal={arXiv preprint arXiv:2106.05642},
  year={2021}
}

Comments

BUG for ONNX inference

when i inference with u2++_conformer, execute just 50 wav files, a bug will be thrown as below: I0515 00:10:40.909694 146005 decoder_main.cc:67] num frames 1118 I0515 00:10:41.026697 146005 decoder_main.cc:86] Partial result: 在机关 I0515 00:10:41.056061 146005 decoder_main.cc:86] Partial result: 在机关服务 I0515 00:10:41.085124 146005 decoder_main.cc:86] Partial result: 在机关围剿 I0515 00:10:41.110785 146005 decoder_main.cc:86] Partial result: 在机关围剿和 I0515 00:10:41.136417 146005 decoder_main.cc:86] Partial result: 在机关围剿和 I0515 00:10:41.176227 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程 I0515 00:10:41.217715 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的 I0515 00:10:41.251241 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中 I0515 00:10:41.282459 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢 I0515 00:10:41.311969 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定 I0515 00:10:41.341024 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是 I0515 00:10:41.398414 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的 I0515 00:10:41.429834 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的 I0515 00:10:41.462321 146005 decoder_main.cc:86] Partial result: 在机关围剿和工程多处的战斗中太勇敢坚定是一军的主要将领 Segmentation fault (core dumped) this file shoud be processed completely, i will go deep to locate the bug info.

onnxruntime: 1.10.0 and 1.11.1

opened by Fred-cell 26
LibTorch gpu cmake error

Hello, when I execute " mkdir build && cd build && cmake -DGRPC=ON ..", the following error is reported, Native environment: centors 7.9 nvidia: 11.3 cuda version: 11

(wenet_gpu) [ZYJ@localhost build]$ cmake -DGPU=ON .. -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Populating libtorch -- Configuring done -- Generating done -- Build files have been written to: /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild [ 11%] Performing download step (download, verify and extract) for 'libtorch-populate' -- verifying file... file='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' -- File already exists and hash match (skip download): file='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' SHA256='0996a6a4ea8bbc1137b4fb0476eeca25b5efd8ed38955218dec1b73929090053' -- extracting... src='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-subbuild/libtorch-populate-prefix/src/libtorch-shared-with-deps-1.10.0%2Bcu113.zip' dst='/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/fc_base/libtorch-src' -- extracting... [tar xfz] -- extracting... [analysis] -- extracting... [rename] -- extracting... [clean up] -- extracting... done [ 22%] No patch step for 'libtorch-populate' [ 33%] No update step for 'libtorch-populate' [ 44%] No configure step for 'libtorch-populate' [ 55%] No build step for 'libtorch-populate' [ 66%] No install step for 'libtorch-populate' [ 77%] No test step for 'libtorch-populate' [ 88%] Completed 'libtorch-populate' [100%] Built target libtorch-populate -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-11.3 (found version "11.3") -- Caffe2: CUDA detected: 11.3 -- Caffe2: CUDA nvcc is: /usr/local/cuda-11.3/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda-11.3 CMake Error at fc_base/libtorch-src/share/cmake/Caffe2/public/cuda.cmake:75 (message): Caffe2: Couldn't determine version from header: Change Dir: /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/gmake cmTC_3d968/fast

/usr/bin/gmake -f CMakeFiles/cmTC_3d968.dir/build.make CMakeFiles/cmTC_3d968.dir/build

gmake[1]: 进入目录“/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp”

Building CXX object CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o

/usr/bin/c++ -I/usr/local/cuda-11.3/include -std=c++14 -pthread -fPIC -o CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o -c /home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/detect_cuda_version.cc

c++: 错误：unrecognized command line option ‘-std=c++14’

gmake[1]: *** [CMakeFiles/cmTC_3d968.dir/detect_cuda_version.cc.o] 错误 1

gmake[1]: 离开目录“/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeTmp”

gmake: *** [cmTC_3d968/fast] 错误 2

Call Stack (most recent call first): fc_base/libtorch-src/share/cmake/Caffe2/Caffe2Config.cmake:88 (include) fc_base/libtorch-src/share/cmake/Torch/TorchConfig.cmake:68 (find_package) cmake/libtorch.cmake:52 (find_package) CMakeLists.txt:35 (include)

-- Configuring incomplete, errors occurred! See also "/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeOutput.log". See also "/home/ZYJ/WeNet/wenet_gpu/wenet/runtime/LibTorch/build/CMakeFiles/CMakeError.log".

please what should Ido?

opened by zhaoyinjiang9825 16
Streaming performance issues on upgrading to release v2.0.0

Describe the bug On updating to release v2.0.0, I've been noticing some performance issues when running real-time audio streams against a quantized e2e model (no LM) via runtime/server/x86/bin/websocket_server_main. For some stretches of time, performance may be comparable between v1 and v2, but there are points where I can expect to see upwards of 20s delay on a given response. Outside of a few minor updates related to the switch, nothing else (e.g. resource allocations) has been changed on my end.

Thus far, I haven't been able to pinpoint much of a pattern to the lag, except that it seems to consistently happen (in addition to other times) at the start of the stream. Have you observed any similar performance issues between v1 and v2, or is there some v2-specific runtime configuration I may have missed?

Expected behavior Comparable real-time performance between releases v1 and v2.

Screenshots The following graphs show the results from a single test. The x-axes represent the progression of the audio file being tested, and the y-axes represent round-trip response times from wenet minus some threshold, i.e. any data points above 0 indicate additional round-trip latency above an acceptable threshold (in my case, 500ms). As you can see, in the v1 graph responses are largely generated and returned below the threshold time (with the exception of a few final-marked transcripts). However, in the v2 graph, there are several lengthy periods during which responses take an unusually long time to return (I've capped the graph at 2s for clearer viewing, but in reality responses are taking up to 20s to return).

Wenet v1

Wenet v2

Additional context Both tests were run with wenet hosted via AWS ECS/EC2. So far as I've seen, increasing CPU + memory allocations to the wenet container doesn't seem to resolve the issue.

opened by kangnari 16
onnx runtime error 2: not enough space: expected 318080, got 314240
Describe the bug 这个bug或许是tritonserver的问题，在使用代码中提供的gpu生产服务（triton server）部署后。直接测试encoder模块时，我需要直接发送fbank的特征到服务器上，此时假如我有三个线程并发的请求，每个线程请求的的step是随机的，也就是fbank的时间步是不一样长的，此时转写的速度会比较慢，但不会报错。这里猜测是由于每个请求的step不一样长，所以没办法组成batch，服务器端的dynamic_batching等待组batch等待耗时较长。于是添加参数max_queue_delay_microseconds等于70000，也就是70ms后就不要等待batch了直接预测，此时客户端就会有一定概率出现异常，异常如下： Traceback (most recent call last): File "debug_encoder.py", line 30, in input_numpy response = triton_client.infer("encoder", File "/opt/conda/lib/python3.8/site-packages/tritonclient/grpc/init.py", line 1156, in infer raise_error_grpc(rpc_error) File "/opt/conda/lib/python3.8/site-packages/tritonclient/grpc/init.py", line 62, in raise_error_grpc raise get_error_grpc(rpc_error) from None tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 318080, got 314240 此时我请求的三个fbank特征的step是482, 497, 485，dims是80，batch_size是1，318080刚好是497808，也就是模型在预测497那个请求时，莫名遇到空间不足的问题。而且在多次并发请求中，这种错是偶发的，出现后继续请求也有可能成功。如果不并发请求，而是一个个请求的话，则不会报错，如果并发请求的尺寸是固定的也不会报错，只有在并发请求不固定长度的时候，且max_queue_delay_microseconds比较小时会报错。

Desktop (please complete the following information):

triton server：21.11

服务器内存16G 显存16G T4显卡，应该不可能是显存或者内存不足
opened by piekey1994 15
Runtime: words containing non-ASCII characters are concatenated without space

The runtime outputs decoded words containing non-ASCII characters as concatenated with neighbouring words: e.g. "aa ää xx yy" is transformed to "aaääxx yy".

This is caused by the code block starting at https://github.com/wenet-e2e/wenet/blob/604231391c81efdf06454dbc99406bbc06cb030d/runtime/core/decoder/torch_asr_decoder.cc#L217

I understand that this is done in order to output Chinese "words" correctly (i.e., without spaces). However, this should at least be configurable, as currently it breaks wenet runtime for most other languages (i.e. those that have words with non-ASCII characters and where words are separated by spaces in the orthography).

opened by alumae 14

cmake compile server/x86 error

Describe the bug A clear and concise description of what the bug is.

environment: centos7
gcc version 7.5.0
cmake version: 3.18.3
CUDA version: 10.2
gpu version:  Quadro RTX 8000

install steps:
$ conda create -n wenet python=3.8
$ conda activate wenet
$ pip install -r requirements.txt
$ conda install pytorch==1.6.0 cudatoolkit=10.2 torchaudio -c pytorch

$ cd wenet/runtime/server/x86/
$ mkdir build && cd build && cmake .. && cmake --build .

ERROR is as follows:

[ 50%] Linking CXX executable ctc_prefix_beam_search_test
/home4/md510/cmake-3.18.3/bin/cmake -E cmake_link_script CMakeFiles/ctc_prefix_beam_search_test.dir/link.txt --verbose=1
/home3/md510/gcc-7.5.0/bin/g++  -std=c++14 -pthread -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -DC10_USE_GLOG -L/cm/shared/apps/cuda10.2/toolkit/10.2.89/lib64 CMakeFiles/ctc_prefix_beam_search_test.dir/decoder/ctc_prefix_beam_search_test.cc.o -o ctc_prefix_beam_search_test   -L/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build/openfst/lib  -Wl,-rpath,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build/openfst/lib:/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib lib/libgtest_main.a lib/libgmock.a libdecoder.a lib/libgtest.a ../fc_base/libtorch-src/lib/libtorch.so -Wl,--no-as-needed,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so -Wl,--as-needed ../fc_base/libtorch-src/lib/libc10.so -lpthread -Wl,--no-as-needed,/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch.so -Wl,--as-needed ../fc_base/libtorch-src/lib/libc10.so kaldi/libkaldi-decoder.a kaldi/libkaldi-lat.a kaldi/libkaldi-util.a kaldi/libkaldi-base.a libutils.a -lfst 
/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so: undefined reference to `lgammaf@GLIBC_2.23'
/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/fc_base/libtorch-src/lib/libtorch_cpu.so: undefined reference to `lgamma@GLIBC_2.23'
collect2: error: ld returned 1 exit status
gmake[2]: *** [ctc_prefix_beam_search_test] Error 1
gmake[2]: Leaving directory `/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build'
gmake[1]: *** [CMakeFiles/ctc_prefix_beam_search_test.dir/all] Error 2
gmake[1]: Leaving directory `/home3/md510/w2020/wenet_20210512/wenet/runtime/server/x86/build'
gmake: *** [all] Error 2

Could you help me to solve it ?

opened by shanguanma 14

DLL load failed while importing _wenet: 找不到指定的模块。

我安装了wenet, pip install wenet. 安装提示成功了。我用例子程序做识别。程序如下： import sys import wenet

def get_text_from_wav(dir, wav): model_dir = dir wav_file = wav decoder = wenet.Decoder(model_dir) ans = decoder.decode_wav(wav_file) print(ans)

if name == 'main': dir = "./models" wav = "./1.wav" get_text_from_wav(dir,wav)

但是运行报错如下： Traceback (most recent call last): File "D:\codes\speech2word\main.py", line 2, in import wenet File "D:\codes\speech2word\venv\lib\site-packages\wenet_init_.py", line 1, in from .decoder import Decoder # noqa File "D:\codes\speech2word\venv\lib\site-packages\wenet\decoder.py", line 17, in import _wenet ImportError: DLL load failed while importing _wenet: 找不到指定的模块。

请问如何解决？

opened by billqu01 13
[Draft] Cache control v2
This is not a merge-ready PR, I just push my testing code for discussion and further evaluation (such as GPU perf, ONNX export, ...).

Performance on CPU (intel i7-10510U @ 1.80GHz), RTF from 0.1 -> 0.07, about 30% improvement:

Detailed descriptions (in Chinese): https://horizonrobotics.feishu.cn/sheets/shtcniLh77AgP6NJAXhd5UHXDwh

Test code:

bash rtf.sh --api 1 > log.txt.1 bash rtf.sh --api 2 > log.txt.2 grep "RTF:" log.txt.1 grep "RTF:" log.txt.2

u2++_conformer.zip: https://horizonrobotics.feishu.cn/file/boxcnO50Ea8m0rR2p9FwJ8ZHEIc words.txt: https://horizonrobotics.feishu.cn/file/boxcnBpSEOWoBSIgLdlHetsjOFd
opened by xingchensong 13

Use DDP training to get stuck

Describe the bug

I got stuck when using DDP training with my own wenet and my own data. And stuck(GPU utilization 100%) at the beginning of the second epoch every time. After debugging, it was found to be stuck in this position:

# wenet/utils/executor.py
with torch.cuda.amp.autocast(scaler is not None):
    loss, loss_att, loss_ctc = model(
        feats, feats_lengths, target, target_lengths)

Environment

CentOS Linux release 7.8.2003 (Core) GPU Driver Version: 450.80.02 CUDA Version: 10.2 torch==1.8.0 torchaudio==1.8.1 torchvision==0.9.0

Some Attempts

I did some attempts later and found: 1 gpu no problem multi gpu stuck static batch no problem dynamic batch stuck conformer no problem unified_conformer stuck

Other attempts： Upgrade pytorch version to 1.9.0, 1.10.0 is useless Set num_workers=0/1 is useless V100 -> P40 useless Sleep 1 minute after completing an epoch is useless NCCL is completely stuck without error log GLOO error log:

2021-12-07 11:36:17,011 INFO Epoch 0 CV info cv_loss 115.3632936241356
2021-12-07 11:36:17,011 INFO Epoch 1 TRAIN info lr 6.08e-06
2021-12-07 11:36:17,014 INFO using accumulate grad, new batch size is 8 times larger than before
2021-12-07 11:36:17,335 INFO Epoch 0 CV info cv_loss 115.36239801458647
2021-12-07 11:36:17,335 INFO Epoch 1 TRAIN info lr 6.200000000000001e-06
2021-12-07 11:36:17,338 INFO using accumulate grad, new batch size is 8 times larger than before
2021-12-07 11:36:17,579 INFO Epoch 0 CV info cv_loss 115.36309641650827
2021-12-07 11:36:17,579 INFO Epoch 1 TRAIN info lr 5.96e-06
2021-12-07 11:36:17,582 INFO using accumulate grad, new batch size is 8 times larger than before
2021-12-07 11:36:17,926 INFO Epoch 0 CV info cv_loss 115.36275817930736
2021-12-07 11:36:17,926 INFO Checkpoint: save to checkpoint exp/conformer/0.pt
2021-12-07 11:36:18,889 INFO Epoch 1 TRAIN info lr 6.32e-06
2021-12-07 11:36:18,892 INFO using accumulate grad, new batch size is 8 times larger than before
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1614378062065/work/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 939336 vs 4
./run.sh: line 165:  7108 Aborted                 (core dumped) python wenet/bin/train.py --gpu $gpu_id --config $train_config --data_type $data_type --symbol_table $dict --train_data data/$train_set/data.list --cv_data data/dev/data.list ${checkpoint:+--checkpoint $checkpoint} --model_dir $dir --ddp.init_method $init_method --ddp.world_size $world_size --ddp.rank $rank --ddp.dist_backend $dist_backend --num_workers 8 $cmvn_opts --pin_memory
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000002.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000110.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000112.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000075.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000001.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/homepath/envs/anaconda3/lib/python3.8/multiprocessing/process.py:108: ResourceWarning: unclosed file <_io.BufferedReader name='/homepath/tools/wenet-uio/examples/aishell/s0/data/train/shards/shards_000000086.tar'>
  self._target(*self._args, **self._kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "wenet/bin/train.py", line 277, in <module>
    main()
  File "wenet/bin/train.py", line 250, in main
    executor.train(model, optimizer, scheduler, train_data_loader, device,
  File "/homepath/tools/wenet-uio/wenet/utils/executor.py", line 71, in train
    loss.backward()
  File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: [/opt/conda/conda-bld/pytorch_1614378062065/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [11.88.165.7]:54008
Traceback (most recent call last):
  File "wenet/bin/train.py", line 277, in <module>
    main()
  File "wenet/bin/train.py", line 250, in main
    executor.train(model, optimizer, scheduler, train_data_loader, device,
  File "/homepath/tools/wenet-uio/wenet/utils/executor.py", line 71, in train
    loss.backward()
  File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/homepath/envs/anaconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Application timeout caused pair closure

To Reproduce

Finally, pull the latest wenet code，reproduced the above problem with aishell recipe:

data_type=shard
train_config=conf/train_unified_conformer.yaml
cmvn=false
dynamic batch
accum_grad=8

　　 How should this be solved? Thank you.

opened by 601222543 12

Decoding hangs when using LM rescoring

I'm following this tutorial to use LM rescoring for decoding: https://github.com/wenet-e2e/wenet/blob/23a61b212bf2c3886546925913f5574f779f474a/examples/librispeech/s0/run.sh#L234

I didn't re-train a model and instead, I use the pre-trained conformer model. I had no problem building the TLG.fst, but ./tools/decode.sh hangs forever when evaluating on the test set. Could you provide any suggestions on where the problem would be and how to debug this?

The below is the code I used for LM rescoring (I took this out from run.sh):

pretrained_model=wenet/models/20210216_conformer_exp
dict=$pretrained_model/words.txt
bpemodel=$pretrained_model/train_960_unigram5000

lm=data/local/lm
lexicon=data/local/dict/lexicon.txt
mkdir -p $lm
mkdir -p data/local/dict

# 7.1 Download & format LM
which_lm=3-gram.pruned.1e-7.arpa.gz
if [ ! -e ${lm}/${which_lm} ]; then
    wget http://www.openslr.org/resources/11/${which_lm} -P ${lm}
fi
echo "unzip lm($which_lm)..."
gunzip -k ${lm}/${which_lm} -c > ${lm}/lm.arpa
echo "Lm saved as ${lm}/lm.arpa"

# 7.2 Prepare dict
unit_file=$dict
bpemodel=$bpemodel
# use $dir/words.txt (unit_file) and $dir/train_960_unigram5000 (bpemodel)
# if you download pretrained librispeech conformer model
cp $unit_file data/local/dict/units.txt
if [ ! -e ${lm}/librispeech-lexicon.txt ]; then
    wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -P ${lm}
fi
echo "build lexicon..."
tools/fst/prepare_dict.py $unit_file ${lm}/librispeech-lexicon.txt \
    $lexicon $bpemodel.model
echo "lexicon saved as '$lexicon'"

# 7.3 Build decoding TLG
tools/fst/compile_lexicon_token_fst.sh \
   data/local/dict data/local/tmp data/local/lang
tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;

# 7.4 Decoding with runtime
echo "Start decoding..."
fst_dir=data/lang_test
dir=$pretrained_model
recog_set="test_clean"
for test in ${recog_set}; do
    ./tools/decode.sh --nj 2 \
        --beam 10.0 --lattice_beam 5 --max_active 7000 --blank_skip_thresh 0.98 \
        --ctc_weight 0.5 --rescoring_weight 1.0 --acoustic_scale 1.2 \
        --fst_path $fst_dir/TLG.fst \
        data/$test/wav.scp.10 data/$test/text.10 $dir/final.zip $fst_dir/words.txt \
        $dir/lm_with_runtime_${test}
    tail $dir/lm_with_runtime_${test}/wer
done

opened by boliangz 12

macOS M1 support?

[ 96%] Linking CXX shared library libwenet_api.dylib ld: warning: ignoring file ../../../fc_base/libtorch-src/lib/libtorch.dylib, building for macOS-arm64 but attempting to link with file built for macOS-x86_64

opened by jinfagang 11
When using libtorch, gpu decoding is slower than cpu.

When using gpu to decode, gpu memory gets allocated but gpu-util rises after a lot of time. For example, if you proceed with decoding 600 voices, it progresses very slowly until about the 100th, and then speeds up from the point when gpu-util rises. Increasing the number of threads in decoder_main.cc makes it faster, but I'd like to fix the problem when it's single-threaded. What should I do?

cpu = 24 cores gpu = rtx a5000(24gb) x 2 ubuntu 20.04.4

opened by hms1205 0
Quantized model under checkpoint mode performs quite different from the one under jit mode

I have trained an original asr model and i convert it into quantized model in both jit mode(named asr_quant.zip) and checkpoint mode (named asr_quant_checkpoint.pt). But the results from the jit mode and the checkpoint mode are quite different.

Quantized model in jit mode: test Final result: 甚至出现交易几乎停滞的情况

Quantized model in checkpoint mode: INFO BAC009S0764W0121 ▁LAWS骑钰阐易ISH燕▁CRITIC▁QUANTITY▁GOING骑燕▁MORE鲨ANSISH致▁GOING燕▁GOING燕▁GOING▁DESIRED▁GOING▁BREATH▁CRITIC俏尺骑▁GOING骑▁PERFECTION燕▁GOING燕▁SH燕▁SH谊▁PERFECTION敷唬诊▁SH定▁OVEN▁ORDERS尹O▁IGNORISH▁PRESIDENTO锣OKA▁PERFECTIONISH燕▁EIGHTEEN笛燕何▁PERFECTION▁INFORMEDLAND何骑▁PRETTY燕湿O▁PERFECTION尺O燕汐辆女何燕翼鲨O▁PERFECTION▁FIRST架燕绘翼盘锣▁THIS▁PRETTY▁SONG▁PERFECTION唬▁INFORMED障渲▁EIGHTEEN锣燕咏劈赌盘涉燕轧▁ABSORB汐O▁PERFECTION锣▁EIGHTEEN燕▁SH燕▁SH敷▁PRESIDENT书敷诊唬治唬唯轧辆▁IGNOR▁DOESN▁PERFECTION▁IGNOR洒翼O▁SAVE▁FIRST▁KISS▁PERFECTION锣▁PERFECTION备惭骑企洒▁PERFECTION洒慌▁SH▁CANDLE▁CHIN▁CANDLE企▁CHIN▁LIBERTY锣▁WEATHER▁FIRST▁COUNTRY敷▁CLERK

opened by PPGGG 2
windows识别没有输出，也没有错报

python version = 3.8.5

先是安装了runtime pip install wenetruntime

然后脚本如下： import sys import torch import wenetruntime as wenet

wav_file = sys.argv[1] decoder = wenet.Decoder(lang='chs') ans = decoder.decode_wav(wav_file) print(ans)

执行脚本给定一个audio.wav音频，没有任何输出，也没有报错信息，脚本就结束了有人知道是为啥吗？我还缺了哪些环境配置吗？

opened by zhhl9101 1
Efficient Conformer implementation
This PR is about our implementation of Efficient Conformer for WeNet encoder structure and runtime.

Original paper: https://arxiv.org/pdf/2109.01163.pdf

Original code: https://github.com/burchim/EfficientConformer

In 58.Com Inc, using Efficient Conformer can reduce CER by 6% relative to Conformer and a 10% increase in inference speed (CPU JIT runtime). Combined with int8 quantization, the inference speed can be improved by 50~70%. More detail of our work: https://mp.weixin.qq.com/s/7T1gnNrVmKIDvQ03etltGQ

Added features

[X] Efficient Conformer Encoder structure

[X] StrideConformerEncoderLayer for "Progressive Downsampling to the Conformer encoder"

[X] GroupedRelPositionMultiHeadedAttention for "Grouped Attention"

[X] Conv2dSubsampling2 for 1/2 Convolution Downsampling

[X] Recognize and JIT export

[X] forward_chunk and forward_chunk_by_chunk in wenet/efficient_conformer/encoder.py

[X] Streaming inference at JIT runtime

[X] TorchAsrModelEfficient in runtime/core/decoder for Progressive Downsampling

[X] Configuration file of Aishell-1

[X] train_u2++_efficonformer_v1.yaml for our online deployment

[X] train_u2++_efficonformer_v2.yaml for Original paper

Developers

Efficient Conformer Encoder structure: ( Yaru Wang & Wei Zhou )

Recognize and JIT export: ( Wei Zhou )

Streaming inference at JIT runtime: ( Yongze Li )

Configuration file of Aishell-1: ( Wei Zhou )

TODO

[ ] ONNX export and runtime

[x] Aishell-1 experiment
opened by zwglory 2
Export ONNX fail with export_onnx_gpu.py

error.log Attached error.log is showed with verbose.

i tried with different onnxruntime versions, still gave the same errors. Simple log is as follow:

python3 wenet/bin/export_onnx_gpu.py --config=/home/ricky/heqing/8w-hours/squeezeformer-8whr-avg2/train.yaml --checkpoint=/home/ricky/heqing/8w-hours/squeezeformer-8whr-avg2/avg_10_156000_13_196000.pt --cmvn_file=/home/ricky/heqing/8w-hours/squeezeformer-8whr-avg2/global_cmvn --ctc_weight=0.5 --output_onnx_dir=/tmp Failed to import k2 and icefall. Notice that they are necessary for hlg_onebest and hlg_rescore Update ctc weight to 0.5 /home/ricky/wenet_train_res/wenet_tools_git/wenet/utils/mask.py:213: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! max_len = max_len if max_len > 0 else lengths.max().item() /home/ricky/wenet_train_res/wenet_tools_git/wenet/transformer/embedding.py:96: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert offset + size < self.max_len /home/ricky/wenet_train_res/wenet_tools_git/wenet/squeezeformer/attention.py:187: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if cache.size(0) > 0: /home/ricky/wenet_train_res/wenet_tools_git/wenet/squeezeformer/attention.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if mask.size(2) > 0: # time2 > 0 /home/ricky/wenet_train_res/wenet_tools_git/wenet/squeezeformer/convolution.py:140: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if mask_pad.size(2) > 0: # time > 0 /home/ricky/wenet_train_res/wenet_tools_git/wenet/squeezeformer/convolution.py:171: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if mask_pad.size(2) > 0: # time > 0 /home/ricky/wenet_train_res/wenet_tools_git/wenet/squeezeformer/subsampling.py:159: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if L - T < 0: [0, 0, 0] [-1, -1, -1] 2022-12-21 19:30:33.720608405 [W:onnxruntime:, constant_folding.cc:150 ApplyImpl] Unsupported output type of N11onnxruntime22SequenceTensorTypeBaseE. Can't constant fold SequenceEmpty node 'SequenceEmpty_2506' 2022-12-21 19:30:33.768034651 [W:onnxruntime:, constant_folding.cc:150 ApplyImpl] Unsupported output type of N11onnxruntime22SequenceTensorTypeBaseE. Can't constant fold SequenceEmpty node 'SequenceEmpty_2506' 2022-12-21 19:30:33.812875437 [W:onnxruntime:, constant_folding.cc:150 ApplyImpl] Unsupported output type of N11onnxruntime22SequenceTensorTypeBaseE. Can't constant fold SequenceEmpty node 'SequenceEmpty_2506' 2022-12-21 19:30:35.151413519 [E:onnxruntime:, sequential_executor.cc:333 Execute] Non-zero status code returned while running MatMul node. Name:'MatMul_2528' Status Message: Not satisfied: K_ == right_shape[right_num_dims - 2] || transb && K_ == right_shape[right_num_dims - 1] matmul_helper.h:42 ComputeMatMul dimension mismatch Traceback (most recent call last): File "wenet/bin/export_onnx_gpu.py", line 574, in onnx_config = export_enc_func(model, configs, args, logger, encoder_onnx_path) File "wenet/bin/export_onnx_gpu.py", line 331, in export_offline_encoder ort_outs = ort_session.run(None, ort_inputs) File "/home/ricky/anaconda3/envs/wenet/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 124, in run return self.sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running MatMul node. Name:'MatMul_2528' Status Message: Not satisfied: K == right_shape[right_num_dims - 2] || transb && K_ == right_shape[right_num_dims - 1] matmul_helper.h:42 ComputeMatMul dimension mismatch

opened by rickychanhoyin 9
undefined value chunk_masks: in squeezformer
Just pulled the latest wenet code and tried out Squeezformer. The training is failed with this log attached below. Any suggestion would be helpful. Thanks.

`the number of model params: 135,220,418 Traceback (most recent call last): File "wenet/bin/train.py", line 309, in main() File "wenet/bin/train.py", line 205, in main script_model = torch.jit.script(model) File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_script.py", line 1257, in script return torch.jit._recursive.create_script_module( File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 451, in create_script_module return create_script_module_impl(nn_module, concrete_type, stubs_fn) File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 517, in create_script_module_impl create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs) File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 368, in create_methods_and_properties_from_stubs concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults) File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 869, in compile_unbound_method create_methods_and_properties_from_stubs(concrete_type, (stub,), ()) File "/home/bsen/miniconda3/envs/wenet/lib/python3.8/site-packages/torch/jit/_recursive.py", line 368, in create_methods_and_properties_from_stubs concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults) RuntimeError: undefined value chunk_masks: File "/home/bsen/wenet_new/examples/squeezformer/wenet/squeezeformer/encoder.py", line 379 pos_emb = recover_pos_emb mask_pad = recover_mask_pad xs = xs.masked_fill(~chunk_masks[:, 0, :].unsqueeze(-1), 0.0) ~~~~~~~~~~~ <--- HERE

factor = self.calculate_downsampling_factor(i)

'SqueezeformerEncoder.forward_chunk' is being compiled since it was called from 'ASRModel.forward_encoder_chunk' File "/home/bsen/wenet_new/examples/squeezformer/wenet/transformer/asr_model.py", line 776

""" return self.encoder.forward_chunk(xs, offset, required_cache_size, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ att_cache, cnn_cache) ~~~~~~~~~~~~~~~~~~~~ <--- HERE`
opened by senbukai0203 2

Releases(v2.1.0)

v2.1.0(Nov 25, 2022)
What's Changed

allow instantiate multiple models in #1580

do not pack libtorch.so in python binding to reduce wheel in #1573 and #1576

support iOS by @Ma-Dan in #1549 🛫

support HLG decode by @aluminumbox in #1521 💯

support squeezeformer by @yygle in #1519 👍

support XPU by @imoisture in #1455 🚀

and so on ...

Source code(tar.gz)
Source code(zip)
v2.0.1(Jun 21, 2022)

This release is for hosting the wenet python binding models.
Source code(tar.gz)
Source code(zip)
chs.tar.gz(174.96 MB)
en.tar.gz(183.94 MB)
v2.0.0(Jun 14, 2022)
The following features are stable.

[x] U2++ framework for better accuracy

[x] n-gram + WFST language model solution

[x] Context biasing(hotword) solution

[x] Very big data training support with UIO

[x] More dataset support, including WenetSpeech, GigaSpeech, HKUST and so on.

Source code(tar.gz)
Source code(zip)
v1.0.0(Jun 21, 2021)
Model

propose and support U2++, as the following graph shows, which uses both forward and backward information at training and decoding.

support dynamic left chunk training and decoding, so we can limit history chunk at decoding to save memory and computation.

support distributed training.

Dataset

Now we support the following five standard speech datasets, and we got SOTA result or close to SOTA result. | 数据集 | 语言 | 数据量(h) | 测试集 | CER/WER | SOTA | |-------------|------|-----------|------------|---------|---------------| | aishell-1 | 中文 | 200 | test | 4.36 | 4.36(WeNet) | | aishell-2 | 中文 | 1000 | test_ios | 5.39 | 5.39(WeNet) | | multi-cn | 中文 | 2385 | / | / | / | | librispeech | 英文 | 1000 | test_clean | 2.66 | 2.10(EspNet) | | gigaspeech | 英文 | 10000 | test | 11.0 | 10.80(EspNet) |

Productivity

Here are some features related to productivity.

LM support. Here is the system design or LM supporting. WeNet can work with/without LM according to your applications/scenarios.

timestamp support.

n-best support.

endpoint support.

gRPC support

further refine x86 server and on-device android recipe.

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 4, 2021)
Major Features

Joint CTC/AED model structure

U2, dynamic chunk training support

Torchaudio support

Runtime x86 and android support

Source code(tar.gz)
Source code(zip)

Production First and Production Ready End-to-End Speech Recognition Toolkit

Related tags

Overview

WeNet

Highlights

Performance Benchmark

Installation

Discussion & Communication

Contributors

Acknowledge

Citations

Comments

Describe the bug

Environment

Some Attempts

To Reproduce

Releases(v2.1.0)

v2.1.0(Nov 25, 2022)

What's Changed

v2.0.1(Jun 21, 2022)

v2.0.0(Jun 14, 2022)

v1.0.0(Jun 21, 2021)

Model

Dataset

Productivity

v0.1.0(Feb 4, 2021)

Major Features

Owner

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

African language Speech Recognition - Speech-to-Text

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

PyTorch implementation of "A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing"

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

This is the first released system towards complex meters` detection and recognition, which is implemented by computer vision techniques.

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

PIKA: a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.