A python package for deep multilingual punctuation prediction.

Oliver Guhr

Last update: Dec 22, 2022

Related tags

Text Data & NLP deepmultilingualpunctuation

Overview

Deep Multilingual Punctuation Prediction

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This uses our "FullStop" model that we trained on the Europarl Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The code restores the following punctuation markers: "." "," "?" "-" ":"

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Usage

The PunctuationModel class an process texts of any length. Note that processing of very long texts can be time consuming.

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label	EN	DE	FR	IT
0	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
macro average	0.775	0.814	0.782	0.762

References

Please cite us if you found this useful:

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

Comments

More languages

A fantastic model! Thank you for sharing it on huggingface. Is it possible to upload also a model based on xml-roberta-base trained on all the languages in the Europarl data set? I have noticed it generalizes well to other languages for some of the more difficult cases, such commas at the end of subordinate clauses, but it misses some cases where the comma is always or almost always present in some languages. For example in Bulgarian: "Отивам на село, защото искам да си почина."; "Той каза, че ще дойде по-късно"; "Мислех си кой може да свърши работата с моя помощ и се сетих, че имам един приятел, който е съдружник във фирма, която прави подобни неща." Is it going to worsen signficantly the performance for English?

opened by orlink 8
how do you use tensorflow gpu with this ?

Is it possible to utilise a gpu when processing text with this script ? I have it working fine but it uses my CPU and I have very large chunks of text I want to process and it take to long with just the cpu.

opened by skintflickz 3
predict() labels each individual character instead of whole word

Code model = PunctuationModel() labled_words = model.predict(data)

Expected output: (something like this) [['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Actual Outpu: (like this) [['S', '0', 0.85909706], ['p', '0', 0.79623544], ['i', '0', 0.97123384], ['d', '0', 0.9573735], ['e', '0', 0.7520275], ['r', '0', 0.83077455], ['-', '0', 0.912434], ['M', '0', 0.79773486], ['a', '0', 0.8560487], ['n', '0', 0.80509317], [' ', 0, 0.80509317], ['i', '0', 0.9247775], ['s', '0', 0.93199074], [' ', 0, 0.93199074], ['a', '0', 0.93133676]]

opened by ghost 2
Error: “Chunk size too large, text got clipped”

Hello : )

First of all, thank you for your work! We are currently using it for a group project at university and it’s super helpful.

There is just one small issue: When we are using the restore_punctuation function on a large enumeration that is missing commas, we get a “Chunk size too large, text got clipped” error as shown below:

We can't seem to find a solution. Do you have any idea how to fix this? Thanks in advance!

Phyllis

In case you want to replicate the error, this is the text snipped we are using: “This is a list: Diindolylmethane (DIM) Dimethylaminoethanol (DMAE) Dong Quai Echinacea Eclipta Alba Egg Shell Calcium Elderberry Eleutherosides Emblica Essential Fatty Acids Energy EDTA Eurycoma Longifolia Evodia Extract Eye Health Ferulic Acid Fish Oils GABA Gamma Tocopherol Garlic Ginger Extract Ginkgo Biloba Ginseng Glucosamine Glutathione Goji Berry Gotu Kola GPC Choline Grape Seed Extract Green Coffee Bean Green Tea Gymnema Sylvestre Health Commentaries Health Questionnaires Heart Attacks Heart Health Herbs Hormone Support Huperzine Immune System Support Inflammation Joint Pain Kohki Tea L-Arginine L-Citrulline L-Cysteine L-Theanine L-Tyrosine Licorice Extract Life Extension Lipoic Acid Liver Support Longevity and Anti-Aging Lutein Luteolin Lycium Berry Lycopene Lysine Magnesium Magnolia Malic Acid Mastic Gum Medium-Chain Triglycerides Melatonin Memory Menopause Menaquinone-7 Men's Health Mental Health Metabolic Syndrome Minerals Mixed Tocopherols Multi-Vitamins Myricetin N-Acetyl Cysteine (NAC) N-Acetyl-carnosine Naringenin Nattokinase Nettle Notoginseng Nutrition Olive Leaf Oolong Tea Omega-3 Fatty Acids OKG Ornithine Oral Chelation Oral Health - Teeth and Gums Osteoporosis Overall Health Pain Relief Parkinson's Disease Passionflower Peony Persimmon Phenylalanine Pheromones Phosphatidylserine Phytosterol Pine Bark Policosanol Pomegranate Pregnenolone Probiotics Prostate Health Pueraria Mirifica Pumpkin Pygeum Africanum R-Lipoic Acid Red Wine Resveratrol Rhodiola Rosea Rosemary Salacia Reticulata Salvia Miltiorrhiza Schisandra Berry Selenium Sexual Health Silymarin Skin Hair and Nails Sleep Soy Phytosterol St John's Wort Stinging Nettle Stress Strontium Citrate Suntheanine Taurine Terminalia Chebula The Common Cold Turmeric Root Tyrosine Uridine Urinary Tract Health Urtica Dioica Root Varicose Veins Vinpocetine Vision Health"

opened by phyllisgraf 2
Bug: Comma with whitespaces leads to error
Great lib, found this small bug:

text = "das , ist fies "

leads to error

TypeError: unsupported operand type(s) for +: 'int' and 'str'

when converting the results to text.
bug
opened by olafthiele 2
User Warning , grouped_entities=False
I followed your tutorial and works great! however i get the following Warning coming from hugging face pipeline:

/usr/local/lib/python3.8/dist-packages/transformers/pipelines/token_classification.py:159: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead. warnings.warn(

Here is the required doc: Hugging Face pipeline

And the line of code to change is: From the PunctuationModel (inside the __init__ method...) both pipeline() calls

Would you consider updating those lines in order to get rid of the warning? Thank you in advance!
opened by leinaxd 0
Full stops after numbers unnoticed, extra ones predicted

Hi and thanks a lot for the great tool!

Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.

This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'

Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.

opened by alexdiment 0

Owner

Oliver Guhr

AI, Robotics, Research

GitHub

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

2.1k Jan 7, 2023

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 10, 2021

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 18, 2021

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

587 Dec 20, 2022

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

27 Jan 5, 2023

A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

3k Jan 6, 2023

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

13 Sep 2, 2022

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

A python package for deep multilingual punctuation prediction.

Related tags

Overview

Deep Multilingual Punctuation Prediction

Install

Usage

Restore Punctuation

Predict Labels

Results

References

Comments

More languages

how do you use tensorflow gpu with this ?

predict() labels each individual character instead of whole word

Error: “Chunk size too large, text got clipped”

Bug: Comma with whitespaces leads to error

User Warning , grouped_entities=False

Full stops after numbers unnoticed, extra ones predicted

Owner

Oliver Guhr

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

A library for Multilingual Unsupervised or Supervised word Embeddings

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

Python package for performing Entity and Text Matching using Deep Learning.

Python package for performing Entity and Text Matching using Deep Learning.