Wav2Vec for speech recognition, classification, and audio classification

Mehrdad Farahani

Last update: Dec 15, 2022

Related tags

Overview

Soxan

در زبان پارسی به نام سخن

This repository consists of models, scripts, and notebooks that help you to use all the benefits of Wav2Vec 2.0 in your research. In the following, I'll show you how to train speech tasks in your dataset and how to use the pretrained models.

How to train

I'm just at the beginning of all the possible speech tasks. To start, we continue the training script with the speech emotion recognition problem.

Training - Notebook

Task	Notebook
Speech Emotion Recognition (Wav2Vec 2.0)
Speech Emotion Recognition (Hubert)
Audio Classification (Wav2Vec 2.0)

Training - CMD

python3 run_wav2vec_clf.py \
    --pooling_mode="mean" \
    --model_name_or_path="lighteternal/wav2vec2-large-xlsr-53-greek" \
    --model_mode="wav2vec2" \ # or you can use hubert
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --train_file=/path/to/train.csv \
    --validation_file=/path/to/dev.csv \
    --test_file=/path/to/test.csv \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --num_train_epochs=5.0 \
    --evaluation_strategy="steps"\
    --save_steps=100 \
    --eval_steps=100 \
    --logging_steps=100 \
    --save_total_limit=2 \
    --do_eval \
    --do_train \
    --fp16 \
    --freeze_feature_extractor

Prediction

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
from src.models import Wav2Vec2ForSpeechClassification, HubertForSpeechClassification

model_name_or_path = "path/to/your-pretrained-model"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate

# for wav2vec
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

# for hubert
model = HubertForSpeechClassification.from_pretrained(model_name_or_path).to(device)


def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate, sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in
               enumerate(scores)]
    return outputs


path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

Output:

[
    {'Emotion': 'anger', 'Score': '0.0%'},
    {'Emotion': 'disgust', 'Score': '99.2%'},
    {'Emotion': 'fear', 'Score': '0.1%'},
    {'Emotion': 'happiness', 'Score': '0.3%'},
    {'Emotion': 'sadness', 'Score': '0.5%'}
]

Demos

Demo	Link
Speech To Text With Emotion Recognition (Persian) - soon	huggingface.co/spaces/m3hrdadfi/speech-text-emotion

Models

Dataset	Model
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/wav2vec2-xlsr-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-gender-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-large-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-base-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition
Eating Sound Collection	m3hrdadfi/wav2vec2-base-100k-eating-sound-collection
GTZAN Dataset - Music Genre Classification	m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres

Comments

'Wav2Vec2FeatureExtractor' object has no attribute 'feature_extractor'

When running wav2vec training using my dataset, the following problem occurred, on line 296 of the run_wav2vec_clf.py script:

Traceback (most recent call last): File "run_wav2vec_clf.py", line 490, in <module> main() File "run_wav2vec_clf.py", line 295, in main target_sampling_rate = feature_extractor.feature_extractor.sampling_rate AttributeError: 'Wav2Vec2FeatureExtractor' object has no attribute 'feature_extractor'

The solution for me was to replace with the line:

target_sampling_rate = feature_extractor.sampling_rate

Hope this helps if anyone else has the same problem.

opened by freds0 0
About the dataset creation and training speed
Hello, @m3hrdadfi , sorry to disturb. I created my own train.csv (4213 records) and dev.csv (527 records) , and run the run_wav2vec_clf.py to train a music genre recognition model, but found that

The cache data is too large, my record is 16khz mp3 and less than 1 GB, but the file generated under ~/.cache/huggingface/datasets/csv/default-f524d204c50754f6/0.0.0/ is larger than more than 18 GB, did you meet this? Morever, the train_dataset.map stage cost more than 2 hours, so long ...

How can I use all the GPU (4) to train? I tried setting CUDA_VISIBLE_DEVICES=0,1,2,3 but not works, and the gpu utilization is very low.

Thanks a lot!
opened by Ramlinbird 0
Demo of training gtzan music genre classifier

You have not included a demo notebook of music genre classifier. I used your pretrained model to predict and the prediction scores seem to be correct. Could you share your training process for this gtzan dataset? If not, could you at least tell me what pretraines model you used to model gtzan dataset? It couldn't have been the same as what you used for modeling eating sounds, right?

opened by 3N4N 0
AttributeError: 'Wav2Vec2Config' object has no attribute 'problem_type'

Hi,can you help me with this problem. thank u!

AttributeError Traceback (most recent call last) in ----> 1 trainer.train()

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs) 1051 raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") 1052 -> 1053 logger.info(f"Loading model from {resume_from_checkpoint}).") 1054 1055 if os.path.isfile(os.path.join(resume_from_checkpoint, CONFIG_NAME)):

in training_step(self, model, inputs) 45 loss = self.compute_loss(model, inputs) 46 else: ---> 47 loss = self.compute_loss(model, inputs) 48 49 if self.args.gradient_accumulation_steps > 1:

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs) 1473 # Save model checkpoint 1474 checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}" -> 1475 1476 if self.hp_search_backend is not None and trial is not None: 1477 if self.hp_search_backend == HPSearchBackend.OPTUNA:

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(),

in forward(self, input_values, attention_mask, output_attentions, output_hidden_states, return_dict, labels) 83 loss = None 84 if labels is not None: ---> 85 if self.config.problem_type is None: 86 if self.num_labels == 1: 87 self.config.problem_type = "regression"

AttributeError: 'Wav2Vec2Config' object has no attribute 'problem_type'

opened by 245789064 0
how to make DataCollatorCTCWithPadding in my own train function

This repo use DataCollatorCTCWithPadding for pading the input waveforms, but is that possible to use this function in my own collate_fn? Or is that possible to use another input in this funtion for example, input_values and input features?

opened by gitgaviny 0
Error when loading tokenizer after fine-tuning
Hi, first of all, congrats on the repo, it;s really useful! I followed the Emotion recognition in Greek speech using Wav2Vec2.ipynb notebook. After finishing the training on my own data, I am getting the following error when trying to load the processor with

processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)

The error:

OSError: Can't load tokenizer for '[/path/to/model/]checkpoint-860/'. If you were trying to load it from 'https://huggingface.co/models', Otherwise, make sure '[/path/to/model/]checkpoint-860/' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

Checking the checkpoint folder, there is no tokenizer file in there, am I missing something? This is the content of the mentioned folder:

PD: the model loads correctly with model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name)
opened by jvel07 0
use_amp not define in CTCTrainer

`2 frames in training_step(self, model, inputs) 41 inputs = self._prepare_inputs(inputs) 42 ---> 43 if self.use_amp: 44 with autocast(): 45 loss = self.compute_loss(model, inputs)

AttributeError: 'CTCTrainer' object has no attribute 'use_amp'`

opened by AMEERAZAM08 1
m3hrdadfi/hubert-base-persian-speech-gender-recognition dataset name

Hi @m3hrdadfi, thanks for your notable endeavors for the Persian community. I want to retrain the speech-gender recognition model and if you have trained it on a dataset other than the common voice dataset I would appreciate it if you could share it. I appreciate any help you can provide.

opened by pooya-mohammadi 0
multi decoder speech model

Hi there, I am planning to finetune XLS-R on multiple decoder heads like langauge detection, ASR, speech to IPA, gender identification etc do you know any XLS-R, WavLM or any speech model implementations of the same preferably in huggingface that i could use to build multiple decoder heads out of a single pretrained model that does all these tasks at once ?

opened by StephennFernandes 0

Owner

Mehrdad Farahani

Researcher, NLP Engineer, Deep Learning Engineer φ

GitHub

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

Python codes for Lite Audio-Visual Speech Enhancement.

Lite Audio-Visual Speech Enhancement (Interspeech 2020) Introduction This is the PyTorch implementation of Lite Audio-Visual Speech Enhancement (LAVSE

85 Dec 1, 2022

A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

292 Dec 25, 2022

Facestar dataset. High quality audio-visual recordings of human conversational speech.

Facestar Dataset Description Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a

87 Dec 21, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

145 Dec 30, 2022

SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version>=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

3 Oct 11, 2022

A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

6 Oct 4, 2022

Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

3 Jan 4, 2023

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

14 Dec 2, 2022

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

565 Jan 4, 2023

AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

23 Jan 25, 2022

PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

40 Sep 19, 2022

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

67 Dec 1, 2022

Speech Recognition using DeepSpeech2.

deepspeech.pytorch Implementation of DeepSpeech2 for PyTorch using PyTorch Lightning. The repo supports training/testing and inference using the DeepS

2k Jan 4, 2023

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

5 Nov 11, 2022

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python.

1 Feb 13, 2022

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

255 Dec 29, 2022