A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Anton Lozhkov

Last update: Oct 23, 2022

Related tags

Text Data & NLP wav2vec-toolkit

Overview

wav2vec-toolkit

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

This repository accompanies the 🤗 HuggingFace Community Paper on finetuning Wav2Vec2 XLSR for low-resource languages [link]

How to contribute

(Mostly identical to the huggingface/datasets contributing guide)

Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

Clone your fork to your local disk, and add the base repository as a remote:

git clone [email protected]:<your Github handle>/wav2vec-toolkit.git
cd wav2vec-toolkit
git remote add upstream https://github.com/anton-l/wav2vec-toolkit.git

Create a new branch to hold your development changes:
```
git checkout -b a-descriptive-name-for-my-changes
```
do not work on the master branch.
Set up a development environment by running the following command in a virtual environment:
```
pip install -e ".[dev]"
```
(If wav2vec-toolkit was already installed in the virtual environment, remove it with pip uninstall wav2vec_toolkit before reinstalling it in editable mode with the -e flag.)
Develop the features on your branch.
Format your code. Run black and isort so that your newly added files look nice with the following command:
```
black --line-length 119 --target-version py36 src scripts
isort src scripts
```
Once you're happy with your implementation, add your changes and make a commit to record your changes locally:
```
git add .
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
```
git fetch upstream
git rebase upstream/main
```
Push the changes to your account using:
```
git push -u origin a-descriptive-name-for-my-changes
```
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

267 Dec 25, 2022

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

9 May 5, 2022

A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架具备数据预处理、模型训练、解码、计算wer等等功能训练数据训练数据采用thchs_30，

4 Dec 9, 2021

An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据，遵循文件结构如下： ├── data │ ├── train │

10 Jul 19, 2022

A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

3 Jan 24, 2022

Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

3 Jan 5, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 3, 2023

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 5, 2021

Comments

Reprogrammed the normalizer and langs functionality

Hi @anton-l

🤩 That's really a good idea to create a toolkit, and I'm willing to contribute for the sake of all possible languages. I reprogrammed the normalizer in order to bring the functionality of text preprocessing for every language separately. It will be easy for every native speaker to add characters to filter or map or even a normalizing and preprocessing procedure at the word or text level.

opened by m3hrdadfi 5
Regarding the Code Organization

Recently, I came across Facebook AI's MMF repository which had this registry feature which I find really cool. Since then, I have followed the same code organization for my projects. One of them you can find at https://github.com/gchhablani/toxic-spans-detection

Basically, there is one registry (or what I call it - configmapper). Every run has a YAML config, and using the registry and YAML config, we load objects/classes in the script.

I'm not sure if we can do the same here, but some basic organization for experiments could follow similar structure. Let me know your thoughts on this.

opened by gchhablani 3
Create ga.py (Irish)

Irish case folding is weird: An tUachtarán -> an t-uachtarán; Ár nAthair -> ár n-athair (chars_to_ignore_regex is pretty small because there was never any wikipedia scraping)

opened by jimregan 0
feat: support wandb
I've added the barebones for W&B support.

Few notes:

I've not built the training logic because I feel it's a separate issue. My own structure follows closely run_common_voice with a few tweaks but I see we already created a few utils in this repo.

I'm not using WANDB_LOG_MODEL and instead manually log all the required files because it would only log the model files while we also want the preprocecssing files (which are not saved with trainer.save_model())

For doing a parameter search using W&B sweeps, we will just need a yaml like this

I didn't add wandb into the requirements (should it be dev only) as I'm not sure if this repo will be mainly for training or also for evaluating (I imagine evaluating would be directly through transformers)
opened by borisdayma 0

Owner

Anton Lozhkov

Machine Learning Engineer @ Embedika

GitHub

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

1 Feb 2, 2022

Asr abc - Automatic speech recognition(ASR),中文语音识别

语音识别的简单示例,主要在课堂演示使用创建python虚拟环境在linux 和macos 上验证通过 # 如果已经有pyhon3.6 环境，跳过该步骤，使用

8 Nov 11, 2022

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

14 Dec 9, 2022

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

31 Nov 7, 2022

GSoC'2021 | TensorFlow implementation of Wav2Vec2

73 Nov 28, 2022

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

20 Jul 11, 2022

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 6, 2023

Model for recasing and repunctuating ASR transcripts

Recasing and punctuation model based on Bert Benoit Favre 2021 This system converts a sequence of lowercase tokens without punctuation to a sequence o

88 Dec 29, 2022

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Related tags

Overview

wav2vec-toolkit

How to contribute

You might also like...

Maix Speech AI lib, including ASR, chat, TTS etc.

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

A demo of chinese asr

An end to end ASR Transformer model training repo

A minimal Conformer ASR implementation adapted from ESPnet.

Paddlespeech Streaming ASR GUI

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Comments

Reprogrammed the normalizer and langs functionality

Regarding the Code Organization

Create ga.py (Irish)

feat: support wandb

Owner

Anton Lozhkov

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Asr abc - Automatic speech recognition(ASR),中文语音识别

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

GSoC'2021 | TensorFlow implementation of Wav2Vec2

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Model for recasing and repunctuating ASR transcripts

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format