📝An easy-to-use package to restore punctuation of the text.

Overview

✏️ rpunct - Restore Punctuation

forthebadge

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

  • Upper-casing
  • Period: .
  • Exclamation: !
  • Question Mark: ?
  • Comma: ,
  • Colon: :
  • Semi-colon: ;
  • Apostrophe: '
  • Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

  1. First, install the package.
pip install rpunct
  1. Sample python code.
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language Number of text samples
English 560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.


The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy Overall F1 Eval Support
91% 90% 45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.


Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.


Comments
  • Update requirements.txt

    Update requirements.txt

    ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

    opened by Rukaya-lab 0
  • Forked repo with fixes

    Forked repo with fixes

    I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

    pip install git+https://github.com/samwaterbury/rpunct.git
    

    Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!

    opened by samwaterbury 2
  • Requirements shouldn't ask for such specific versions

    Requirements shouldn't ask for such specific versions

    First, thanks a lot for providing this package :)

    Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

    opened by adefossez 2
  • Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

    rpunct = RestorePuncts(ner_args={"use_cuda": False})
    

    Before this change, when running rpunct examples on the CPU the following error occurs:

    from rpunct import RestorePuncts
    # The default language is 'english'
    rpunct = RestorePuncts()
    rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
    by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
    a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
    professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
    3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
    
    

    ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

    ~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

    ~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

    ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.

    opened by nbertagnolli 1
  • add use_cuda parameter

    add use_cuda parameter

    using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

    opened by mjfox3 1
Releases(1.0.1)
Owner
Daulet Nurmanbetov
Deep Learning, AI and Finance
Daulet Nurmanbetov
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

null 382 Jan 7, 2023
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 5, 2022
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 6, 2023
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1 score 92.

null 1 Nov 19, 2021
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 2, 2023
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 461 Dec 28, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 7, 2023