A python package for deep multilingual punctuation prediction.

Overview

Deep Multilingual Punctuation Prediction

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This uses our "FullStop" model that we trained on the Europarl Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The code restores the following punctuation markers: "." "," "?" "-" ":"

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Usage

The PunctuationModel class an process texts of any length. Note that processing of very long texts can be time consuming.

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label EN DE FR IT
0 0.991 0.997 0.992 0.989
. 0.948 0.961 0.945 0.942
? 0.890 0.893 0.871 0.832
, 0.819 0.945 0.831 0.798
: 0.575 0.652 0.620 0.588
- 0.425 0.435 0.431 0.421
macro average 0.775 0.814 0.782 0.762

References

Please cite us if you found this useful:

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}
Comments
  • More languages

    More languages

    A fantastic model! Thank you for sharing it on huggingface. Is it possible to upload also a model based on xml-roberta-base trained on all the languages in the Europarl data set? I have noticed it generalizes well to other languages for some of the more difficult cases, such commas at the end of subordinate clauses, but it misses some cases where the comma is always or almost always present in some languages. For example in Bulgarian: "Отивам на село, защото искам да си почина."; "Той каза, че ще дойде по-късно"; "Мислех си кой може да свърши работата с моя помощ и се сетих, че имам един приятел, който е съдружник във фирма, която прави подобни неща." Is it going to worsen signficantly the performance for English?

    opened by orlink 8
  • how do you use tensorflow gpu with this ?

    how do you use tensorflow gpu with this ?

    Is it possible to utilise a gpu when processing text with this script ? I have it working fine but it uses my CPU and I have very large chunks of text I want to process and it take to long with just the cpu.

    opened by skintflickz 3
  • predict() labels each individual character instead of whole word

    predict() labels each individual character instead of whole word

    Code model = PunctuationModel() labled_words = model.predict(data)

    Expected output: (something like this) [['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

    Actual Outpu: (like this) [['S', '0', 0.85909706], ['p', '0', 0.79623544], ['i', '0', 0.97123384], ['d', '0', 0.9573735], ['e', '0', 0.7520275], ['r', '0', 0.83077455], ['-', '0', 0.912434], ['M', '0', 0.79773486], ['a', '0', 0.8560487], ['n', '0', 0.80509317], [' ', 0, 0.80509317], ['i', '0', 0.9247775], ['s', '0', 0.93199074], [' ', 0, 0.93199074], ['a', '0', 0.93133676]]

    opened by ghost 2
  • Error: “Chunk size too large, text got clipped”

    Error: “Chunk size too large, text got clipped”

    Hello : )

    First of all, thank you for your work! We are currently using it for a group project at university and it’s super helpful.

    There is just one small issue: When we are using the restore_punctuation function on a large enumeration that is missing commas, we get a “Chunk size too large, text got clipped” error as shown below:

    Error

    We can't seem to find a solution. Do you have any idea how to fix this? Thanks in advance!

    Phyllis

    In case you want to replicate the error, this is the text snipped we are using: “This is a list: Diindolylmethane (DIM) Dimethylaminoethanol (DMAE) Dong Quai Echinacea Eclipta Alba Egg Shell Calcium Elderberry Eleutherosides Emblica Essential Fatty Acids Energy EDTA Eurycoma Longifolia Evodia Extract Eye Health Ferulic Acid Fish Oils GABA Gamma Tocopherol Garlic Ginger Extract Ginkgo Biloba Ginseng Glucosamine Glutathione Goji Berry Gotu Kola GPC Choline Grape Seed Extract Green Coffee Bean Green Tea Gymnema Sylvestre Health Commentaries Health Questionnaires Heart Attacks Heart Health Herbs Hormone Support Huperzine Immune System Support Inflammation Joint Pain Kohki Tea L-Arginine L-Citrulline L-Cysteine L-Theanine L-Tyrosine Licorice Extract Life Extension Lipoic Acid Liver Support Longevity and Anti-Aging Lutein Luteolin Lycium Berry Lycopene Lysine Magnesium Magnolia Malic Acid Mastic Gum Medium-Chain Triglycerides Melatonin Memory Menopause Menaquinone-7 Men's Health Mental Health Metabolic Syndrome Minerals Mixed Tocopherols Multi-Vitamins Myricetin N-Acetyl Cysteine (NAC) N-Acetyl-carnosine Naringenin Nattokinase Nettle Notoginseng Nutrition Olive Leaf Oolong Tea Omega-3 Fatty Acids OKG Ornithine Oral Chelation Oral Health - Teeth and Gums Osteoporosis Overall Health Pain Relief Parkinson's Disease Passionflower Peony Persimmon Phenylalanine Pheromones Phosphatidylserine Phytosterol Pine Bark Policosanol Pomegranate Pregnenolone Probiotics Prostate Health Pueraria Mirifica Pumpkin Pygeum Africanum R-Lipoic Acid Red Wine Resveratrol Rhodiola Rosea Rosemary Salacia Reticulata Salvia Miltiorrhiza Schisandra Berry Selenium Sexual Health Silymarin Skin Hair and Nails Sleep Soy Phytosterol St John's Wort Stinging Nettle Stress Strontium Citrate Suntheanine Taurine Terminalia Chebula The Common Cold Turmeric Root Tyrosine Uridine Urinary Tract Health Urtica Dioica Root Varicose Veins Vinpocetine Vision Health"

    opened by phyllisgraf 2
  • Bug: Comma with whitespaces leads to error

    Bug: Comma with whitespaces leads to error

    Great lib, found this small bug:

    text = "das , ist fies "
    

    leads to error

    TypeError: unsupported operand type(s) for +: 'int' and 'str'

    when converting the results to text.

    bug 
    opened by olafthiele 2
  • User Warning , grouped_entities=False

    User Warning , grouped_entities=False

    I followed your tutorial and works great! however i get the following Warning coming from hugging face pipeline:

    /usr/local/lib/python3.8/dist-packages/transformers/pipelines/token_classification.py:159: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
      warnings.warn(
    

    Here is the required doc: Hugging Face pipeline

    And the line of code to change is: From the PunctuationModel (inside the __init__ method...) both pipeline() calls

    Would you consider updating those lines in order to get rid of the warning? Thank you in advance!

    opened by leinaxd 0
  • Full stops after numbers unnoticed, extra ones predicted

    Full stops after numbers unnoticed, extra ones predicted

    Hi and thanks a lot for the great tool!

    Seems that in the original punctuation removal step, punctuation in numbers is intentionally kept. Perhaps due to decimal point issues or ordinal number representation in some languages.

    This, however, results in extra punctuation being predicted when a number is at the end of a sentence: 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42.' becomes 'The Answer to the Ultimate Question of Life, the Universe, and Everything is 42..'

    Not sure what would be an elegant solution to this. The punctuation-stripping regex can't tell apart ordinal marks from sentence-final full-stops. Would be nice to trust the LM to predict all the punctuation, i.e., remove all of it in the pre-processing step.

    opened by alexdiment 0
Owner
Oliver Guhr
AI, Robotics, Research
Oliver Guhr
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 7, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 7, 2023
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 1.8k Feb 10, 2021
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 1.8k Feb 18, 2021
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

TextFlint 587 Dec 20, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 5, 2023
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 6, 2023
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
null 189 Jan 2, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 2, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machin

Qian Wang 21 Dec 17, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 461 Dec 28, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 276 Feb 9, 2021