Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode

Patrick von Platen

Last update: Oct 22, 2022

Related tags

Deep Learning Wav2Vec2_PyCTCDecode

Overview

🤗 Transformers Wav2Vec2 + PyCTCDecode

Introduction

This repo shows how 🤗 Transformers can be used in combination with kensho-technologies's PyCTCDecode & KenLM ngram as a simple way to boost word error rate (WER).

Included is a file to create an ngram with KenLM as well as a simple evaluation script to compare the results of using Wav2Vec2 with PyCTCDecode + KenLM vs. without using any language model.

Note: The scripts are written to be used on GPU. If you want to use a CPU instead, simply remove all .to("cuda") occurances in eval.py.

Installation

In a first step, one should install KenLM. For Ubuntu, it should be enough to follow the installation steps described here. The installed kenlm folder should be move into this repo for ./create_ngram.py to function correctly. Alternatively, one can also link the lmplz binary file to a lmplz bash command to directly run lmplz instead of ./kenlm/build/bin/lmplz.

Next, some Python dependencies should be installed. Assuming PyTorch is installed, it should be sufficient to run pip install -r requirements.txt.

Run evaluation

Create ngram

In a first step on should create a ngram. E.g. for polish the command would be:

./create_ngram.py --language polish --path_to_ngram polish.arpa

After the language model is created, one should open the file. one should add a </s> The file should have a structure which looks more or less as follows:

\data\        
ngram 1=86586
ngram 2=546387
ngram 3=796581           
ngram 4=843999             
ngram 5=850874              
                                                  
\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356                                                                            
-3.4645514      drugi   -0.2088903
...

Now it is very important also add a </s> token to the n-gram so that it can be correctly loaded. You can simple copy the line:

0 <s> -0.06677356

and change <s> to </s>. When doing this you should also inclease ngram by 1. The new ngram should look as follows:

\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874

\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356
0       </s>     -0.06677356
-3.4645514      drugi   -0.2088903
...

Now the ngram can be correctly used with pyctcdecode

Run eval

Having created the ngram, one can run:

./eval.py --language polish --path_to_ngram polish.arpa

To compare Wav2Vec2 + LM vs. Wav2Vec2 + No LM on polish.

Results

Without tuning any hyperparameters, the following results were obtained:

Comparison of Wav2Vec2 without Language model vs. Wav2Vec2 with `pyctcdecode` + KenLM 5gram.
Fine-tuned Wav2Vec2 models were used and evaluated on MLS datasets.
Take a closer look at `./eval.py` for comparison

==================================================portuguese==================================================
polish - No LM - | WER: 0.3069742867206763 | CER: 0.06054530156286364 | Time: 58.04590034484863
polish - With LM - | WER: 0.2291299753434308 | CER: 0.06211174564528545 | Time: 191.65409898757935

==================================================spanish==================================================
portuguese - No LM - | WER: 0.18208286674132138 | CER: 0.05016682956422096 | Time: 114.61633825302124
portuguese - With LM - | WER: 0.1487761958086706 | CER: 0.04489231909945738 | Time: 429.78511357307434

==================================================polish==================================================
spanish - No LM - | WER: 0.2581272104769545 | CER: 0.0703088156033147 | Time: 147.8634352684021
spanish - With LM - | WER: 0.14927852292116295 | CER: 0.052034208044195916 | Time: 563.0732748508453

It can be seen that the word error rate (WER) is significantly improved when using PyCTCDecode + KenLM. However, the character error rate (CER) does not improve as much or not at all. This is expected since using a language model will make sure that words that are predicted are words that exist in the language's vocabulary. Wav2Vec2 without a LM produces many words that are more or less correct but contain a couple of spelling errors, thus not contributing to a good WER. Those words are likely to be "corrected" by Wav2Vec2 + LM leading to an improved WER. However a Wav2Vec2 already has a good character error rate as its vocabulary is composed of characters meaning that a "word-based" language model doesn't really help in this case.

Overall WER is probably the more important metric though, so it might make a lot of sense to add a LM to Wav2Vec2.

In terms of speed, adding a LM significantly reduces speed. However, the script is not at all optimized for speed so using multi-processing and batched inference would significantly speed up both Wav2Vec2 without LM and with LM.

Simple Python project using Opencv and datetime package to recognise faces and log attendance data in a csv file.

Attendance-System-based-on-Facial-recognition-Attendance-data-stored-in-csv-file- Simple Python project using Opencv and datetime package to recognise

3 Aug 9, 2022

High accurate tool for automatic faces detection with landmarks

faces_detanator High accurate tool for automatic faces detection with landmarks. The library is based on public detectors with high accuracy (TinaFace

7 May 10, 2022

Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Installation After cloning the repo open

37 Dec 3, 2022

Computational inteligence project on faces in the wild dataset

Table of Contents The general idea How these scripts work? Loading data Needed modules and global variables Parsing the arrays in dataset Extracting a

4 Oct 21, 2022

A repo that contains all the mesh keys needed for mesh backend, along with a code example of how to use them in python

Mesh-Keys A repo that contains all the mesh keys needed for mesh backend, along with a code example of how to use them in python Have been seeing alot

53 Dec 13, 2022

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

ConTNet Introduction ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large rec

93 Nov 8, 2022

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

3 Dec 28, 2022

this is a lite easy to use virtual keyboard project for anyone to use

virtual_Keyboard this is a lite easy to use virtual keyboard project for anyone to use motivation I made this for this year's recruitment for RobEn AA

3 Oct 23, 2021

A collection of easy-to-use, ready-to-use, interesting deep neural network models

Interesting and reproducible research works should be conserved. This repository wraps a collection of deep neural network models into a simple and un

16 Jun 16, 2022

Comments

Not able to reproduce greedy search results with beam_width=1

Hi @patrickvonplaten , I am following this tutorial https://huggingface.co/blog/wav2vec2-with-ngram . I have tried with some beam width variations, but getting large WER's compared to greedy search. So, I tried it with passing beam_width=1 in decode function of pyctcdecode but getting similar WER to beam_width = {4,10, default=100}. Is it normal behaviour? I suppose , beam_width=1 should give similar results as of greedy decoding.

opened by ashu5644 2
KenLM autoconversion/fix
Avoids the FormatLoadException error by doing automated token & ngram_1 line conversion.

Input: KenLM generated arpa lm.

Output pyctcdecode supported arpa lm.
opened by deepconsc 1
Word Level or Char Level language model?

Thanks @patrickvonplaten for this repo, it really helped a lot!

Just a question here, what is the best language model for CTC decoding? is it a character-level or word-level language model? I am assuming a character level should be the choice as wav2vec decodes characters. However, it seems that the practice is to use a word-level one. I notice that in many repos and posts. Please correct me if I am wrong. Also, if so, can you please elaborate on why word-level language models are preferred over char-level ones?

opened by MagedSaeed 1

Error when running eval script

I followed the tutorial and installed kenlm inside the subfolder kenlm and created the polish.arpa file and updated it.

I get the following error when running ./eval.py --language polish --path_to_ngram polish.arpa I'm running python 3.7 using conda on and ubuntu machine with GPU I manually installed pyctcdecode since it wasn't included in the requirements file.

./eval.py --language polish --path_to_ngram polish.arpa
Traceback (most recent call last):
  File "./eval.py", line 8, in <module>
    from pyctcdecode import build_ctcdecoder
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/__init__.py", line 3, in <module>
    from .decoder import BeamSearchDecoderCTC, build_ctcdecoder  # noqa
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 26, in <module>
    from .language_model import (
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/language_model.py", line 55, in <module>
    def _prepare_unigram_set(unigrams: Collection[str], kenlm_model: kenlm.Model) -> Set[str]:
AttributeError: module 'kenlm' has no attribute 'Model'

opened by BirgerMoell 5

Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode

Related tags

Overview

🤗 Transformers Wav2Vec2 + PyCTCDecode

Introduction

Installation

Run evaluation

Create ngram

Run eval

Results

You might also like...

Simple Python project using Opencv and datetime package to recognise faces and log attendance data in a csv file.

High accurate tool for automatic faces detection with landmarks

Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

Computational inteligence project on faces in the wild dataset

A repo that contains all the mesh keys needed for mesh backend, along with a code example of how to use them in python

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

this is a lite easy to use virtual keyboard project for anyone to use

A collection of easy-to-use, ready-to-use, interesting deep neural network models

Comments

Not able to reproduce greedy search results with beam_width=1

KenLM autoconversion/fix

Word Level or Char Level language model?

Error when running eval script

Owner

Patrick von Platen

Implementation of CVAE. Trained CVAE on faces from UTKFace Dataset to produce synthetic faces with a given degree of happiness/smileyness.

This repo tries to recognize faces in the dataset you created

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

🤗 Push your spaCy pipelines to the Hugging Face Hub

A PyTorch Lightning Callback for pushing models to the Hugging Face Hub 🤗⚡️

Text completion with Hugging Face and TensorFlow.js running on Node.js

Python scripts to detect faces in Python with the BlazeFace Tensorflow Lite models

This is an easy python software which allows to sort images with faces by gender and after by age.

Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)