A 10000+ hours dataset for Chinese speech recognition

Last update: Jan 3, 2023

Related tags

Deep Learning WenetSpeech

Overview

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit	Dev	Test_Net	Test_Meeting	AIShell-1
Kaldi	9.07	12.83	24.72	5.41
ESPNet	9.70	8.90	15.90	3.90
WeNet	8.88	9.70	15.59	4.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Set	Hours	Confidence	Usage
High Label	10005	>=0.95	Supervised Training
Weak Label	2478	[0.6, 0.95]	Semi-supervised or noise training
Unlabel	9952	/	Unsupervised training or Pre-training
In Total	22435	/	All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

Domain	Youtube	Podcast	Total
audiobook	0	250.9	250.9
commentary	112.6	135.7	248.3
documentary	386.7	90.5	477.2
drama	4338.2	0	4338.2
interview	324.2	614	938.2
news	0	868	868
reading	0	1110.2	1110.2
talk	204	90.7	294.7
variety	603.3	224.5	827.8
others	144	507.5	651.5
Total	6113	3892	10005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training Subsets	Confidence	Hours
L	[0.95, 1.0]	10005
M	1.0	1000
S	1.0	100

Evaluation Sets

Evaluation Sets	Hours	Source	Description
DEV	20	Internet	Specially designed for some speech tools which require cross-validation set in training
TEST_NET	23	Internet	Match test
TEST_MEETING	15	Real meeting	Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

ACKNOWLEDGEMENTS

WenetSpeech refers a lot of work of GigaSpeech, and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.
We thank Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank MindSpore for the support of this work, which is a new deep learning computing framework.
Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.

Comments

How to access the weakly labeled and unlabeled data?

Hi team！Thanks for providing this dataset.

after running WenetSpeech/toolkits/kaldi/local/wenetspeech_data_prep.sh with argument --train-subset L, it seems that the kaldi dataset yield from this script contains only 10k hours of the high-label data. What should I do if I want to use the remained part of Wenetspeech dataset?

Thanks :)

opened by jctian98 2
The mismatch between the marked duration and the actual audio duration.

I am using k2 and Lhotse for wenetspeech ASR experiments. But there is an error happened. The error shows as follows:

And then I check the actual duration for this sample (its marked duration is 786.44s):

I find the marked duration is 988.89s.

So can we change the marked duration in the original marked transcripts? Or I should filter it with a filtering function to avoid this error?

opened by luomingshuang 2
pretrained weights?

Dear autor; thanks for published such a large-scale and useful dataset. I wonder have you released some of your pretrained weights? If so, it can save a lot of energy consuption and human resources since the training procedure is relatively large and expensive. Thank you.

opened by dragen1860 1
[Question] about the results
Hi wenet team, thanks for this open dataset. I have some questions about the results in https://github.com/wenet-e2e/WenetSpeech/blob/main/README.md#benchmark

The espnet model is trained for 50 epochs, while wenet model only trained for half of that (26 epochs), why not both trained for the same iteration number?

The espnet model use an external Transformer LM in decoding, does wenet have the result decoding with an external LM?
opened by maxwellzh 1
Error when untar the encrypted dataset

After downloading the whole dataset, an error occurs when doing the function process_downloaded_object. This seems to occur when untar the encrypted dataset.

opened by ZihanLiao 0
utils not find

when i train wenetspeech using Kaldi,i get an error: ./run.sh: line 45: ./utils/parse_options.sh: No such file or directory.

in path.sh, export PATH=$PWD/utils/,it will add utils to the path, but in toolkits/kaldi directory, there is no utils.is utils missing?

opened by jiangno111 0
fix process_opus.py

Modify the file according to PR https://github.com/wenet-e2e/WenetSpeech/pull/10 to fix the lint error. And we are not going to merge the PR in the future.

opened by robin1001 0
CC-BY-NC vs CC-BY

I think if you want your data to be non-commercial, the license should be CC-BY-NC (https://creativecommons.org/licenses/by-nc/4.0/) rather than CC-BY (https://creativecommons.org/licenses/by/4.0/).

opened by tshmak 0

Owner

Production First and Production Ready End-to-End Speech Toolkit

GitHub

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

156 Nov 28, 2022

Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

109 Dec 14, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Facestar dataset. High quality audio-visual recordings of human conversational speech.

Facestar Dataset Description Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a

87 Dec 21, 2022

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

565 Jan 4, 2023

AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

23 Jan 25, 2022

PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

40 Sep 19, 2022

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

145 Dec 30, 2022

Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

67 Dec 1, 2022

Speech Recognition using DeepSpeech2.

deepspeech.pytorch Implementation of DeepSpeech2 for PyTorch using PyTorch Lightning. The repo supports training/testing and inference using the DeepS

2k Jan 4, 2023

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

5 Nov 11, 2022

SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version>=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

3 Oct 11, 2022

A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

6 Oct 4, 2022

Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

3 Jan 4, 2023

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

14 Dec 2, 2022

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python.

1 Feb 13, 2022

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

274 Dec 6, 2022