Model for recasing and repunctuating ASR transcripts

Benoit Favre

Last update: Dec 29, 2022

Related tags

Text Data & NLP recasepunc

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

lower: keep lowercase
upper: convert to upper case
capitalize: set first letter as upper case
other: left as is

And the following punctuation labels:

o: no punctuation
period: .
comma: ,
question: ?
exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
- Iterations: 19000
- Batch size: 16
- Max length: 256
- Seed: 871253
- Cut-drop probability: 0.1
- Train loss: 0.021128975618630648
- Valid loss: 0.015684964135289192
- Recasing accuracy: 96.73
- Punctuation accuracy: 95.02
  - All punctuation F-score: 67.79
  - Comma F-score: 67.94
  - Period F-score: 72.91
  - Question F-score: 57.57
  - Exclamation mark F-score: 15.78
- Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y

Comments

Is it possible to customize for new language?

Dear Benoit Favre,

Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

Thank you in advance!

opened by ican24 5
Can't get attribute 'WordpieceTokenizer'

Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

opened by padmalcom 4
Can't do inference

Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

opened by MatFrancois 3
Memory usage

Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

opened by gubri 1

Russian model doesn't work, while English does

When I use Russian model, it gives me this error:

WARNING: reverting to cpu as cuda is not available
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']

 File "C:\pypy\rus\recasepunc.py", line 741, in <module>
    main(config, config.action, config.action_args)
  File "C:\pypy\rus\recasepunc.py", line 715, in main
    generate_predictions(config, *args)
  File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
    for line in sys.stdin:
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte

 File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
  File "C:\pypy\app.py", line 32, in process_audio
    cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.

Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

opened by xenia19 0

Export model to be Used in C++
Is it possible that export model to something that can be used in C++ using libtorch?

export existing model(checkpoint provided in this repo)

export model after I train with my own data which option above possible, or both?
opened by leohuang2013 0
While running pretrained German model: AttributeError: Can't get attribute 'Trie' on

I am trying to use pretrained German model:

https://alphacephei.com/vosk/models/vosk-recasepunc-de-0.21.zip

and as mentioned in readme file, I run:

python example.py de-test.txt

but I keep getting following error:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/home/ali/ali_initos_work/internal/data_science/speech_to_text/vosk/vosk_env/lib/python3.7/site-packages/transformers/tokenization_utils.py'>

Any idea if the model itself is wrong?

opened by alihashaam 2

RuntimeError when predicting with the french models

I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

j'aime les fleurs les olives et la raclette

When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

Traceback (most recent call last): File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module> main(config, config.action, config.action_args) File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main generate_predictions(config, *args) File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions model.load_state_dict(loaded['model_state_dict']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids".

I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

opened by maelchiotti 2
parameters like --dab_rate can't be set from cmd line bc they are bool
look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

main(config, config.action, config.action_args)

'''
opened by al-zatv 0
Cannot use trained model for validation or prediction

Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

opened by khusainovaidar 5

Releases(0.3)

0.3(Feb 3, 2022)

Checkpoint release
Source code(tar.gz)
Source code(zip)
en.23000(1249.49 MB)
fr-txt.large.19000(523.93 MB)
fr.22000(1575.50 MB)
zh.24000(1166.63 MB)
0.2(Sep 26, 2021)

Fix predictions when input is shorter than max length
Source code(tar.gz)
Source code(zip)
0.1(Sep 20, 2021)

First French model trained on 160M tokens from common crawl.
Source code(tar.gz)
Source code(zip)
fr-txt.large.19000(1571.78 MB)

Owner

Benoit Favre

GitHub

Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

7 Sep 17, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 1, 2022

An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据，遵循文件结构如下： ├── data │ ├── train │

10 Jul 19, 2022

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

29 Oct 23, 2022

Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

267 Dec 25, 2022

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

9 May 5, 2022

A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架具备数据预处理、模型训练、解码、计算wer等等功能训练数据训练数据采用thchs_30，

4 Dec 9, 2021

A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

3 Jan 24, 2022

Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

3 Jan 5, 2022

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

14 Dec 9, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 3, 2023

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

121 Dec 27, 2022

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

31 Nov 7, 2022

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Regression Free Model Update Code for the paper: Regression Bugs Are In Your Mod

2 Feb 17, 2022

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

65 Sep 21, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力，旨在为飞桨开发者提升文本建模效率，并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 1, 2023

Model for recasing and repunctuating ASR transcripts

Related tags

Overview

Recasing and punctuation model based on Bert

Installation

Prediction

Models

Training

Comments

Releases(0.3)

0.3(Feb 3, 2022)

0.2(Sep 26, 2021)

0.1(Sep 20, 2021)

Owner

Benoit Favre

Quick insights from Zoom meeting transcripts using Graph + NLP

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

An end to end ASR Transformer model training repo

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Maix Speech AI lib, including ASR, chat, TTS etc.

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

A demo of chinese asr

A minimal Conformer ASR implementation adapted from ESPnet.

Paddlespeech Streaming ASR GUI

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

NLP Core Library and Model Zoo based on PaddlePaddle 2.0