Multilingual text (NLP) processing toolkit

RAMI ALRFOU

Last update: Jan 7, 2023

Related tags

Text Data & NLP polyglot

Overview

polyglot

Polyglot is a natural language pipeline that supports massive multilingual applications.

Free software: GPLv3 license
Documentation: http://polyglot.readthedocs.org.

Features

Tokenization (165 Languages)
Language detection (196 Languages)
Named Entity Recognition (40 Languages)
Part of Speech Tagging (16 Languages)
Sentiment Analysis (136 Languages)
Word Embeddings (137 Languages)
Morphological analysis (135 Languages)
Transliteration (69 Languages)

Developer

Rami Al-Rfou @ rmyeid gmail com

Quick Tutorial

import polyglot
from polyglot.text import Text, Word

Language Detection

text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

Language Detected: Code=fr, Name=French

Tokenization

zen = Text("Beautiful is better than ugly. "
           "Explicit is better than implicit. "
           "Simple is better than complex.")
print(zen.words)

[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']

print(zen.sentences)

[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Part of Speech Tagging

text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))

Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediência   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT

Named Entity Recognition

text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)

[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]

Polarity

print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
Beautiful        0
is               0
better           1
than             0
ugly            -1
.                0

Embeddings

word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
    print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])

Neighbors (Synonms) of Obama
------------------------------
Bush
Reagan
Clinton
Ahmadinejad
Nixon
Karzai
McCain
Biden
Huckabee
Lula


The first 10 dimensions out the 256 dimensions

[-2.57382345  1.52175975  0.51070285  1.08678675 -0.74386948 -1.18616164
  2.92784619 -0.25694436 -1.40958667 -2.39675403]

Morphology

word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

[u'Pre', u'process', u'ing']

Transliteration

from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))

препрокессинг

Comments

ImportError: No module named 'icu'

python3.4: pip install polyglot from polyglot.text import Text, Word ---> 11 from icu import Locale 12 import pycld2 as cld2 13

ImportError: No module named 'icu'

Its not a module dependency nor is it mentioned in readme.

opened by Fiedzia 35
polyglot_data on windows.

Hi, Have installed polyglot on windows with python 3.4, after some lib problems solved, I start getting this error:

downloader.download() Polyglot Downloader

---------------------------------------------------------------------------

d) Download l) List u) Update c) Config h) Help q) Quit

---------------------------------------------------------------------------

Downloader> l

Collections: Traceback (most recent call last): File "", line 1, in File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 649, in download self._interactive_download() File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1068, in _interactive_download DownloaderShell(self).run() File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 1096, in run more_prompt=True) File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 459, in list for info in sorted(getattr(self, category)(), key=str): File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 495, in collections self._update_index() File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 832, in _update_index P = Package.fromcsobj(p) File "C:\Python34\lib\site-packages\polyglot-15.5.2-py3.4.egg\polyglot\downloader.py", line 232, in fromcsobj language = subdir.split(path.sep)[1] IndexError: list index out of range

After some analysis (and some neurons less...) e got the problem, on windows "path.sep" is,as expected "" instead of "/", since the packages are "named" or "ID(ed)" with "/" it makes no sense the path.sep on windows users? Or am I missing something I should had installed?

A replace on path.sep for "/" solve the problem and allow me to list and download any data I want to my polyglot installation.

opened by xTomax 20
Where to get all models as one archive?

I\m trying to download models from http://whoisbigger.com/polyglot. But unfortunately it shows 0 bps after some time. Could you give me a link to an alternative donwload?

opened by hodzanassredin 15
ImportError: ~/anaconda3/lib/python3.5/site-packages/_icu.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZTIN6icu_5714LEFontInstanceE

Can't run polyglot project, please help

ImportError: ~/anaconda3/lib/python3.5/site-packages/_icu.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZTIN6icu_5714LEFontInstanceE

opened by bilalbayasut 14
Trouble Installing

This is using "pip install polyglot".

I've located some useful arguments that can help here, but I'm not sure how to add them to the cc command.

Complete output from command /usr/bin/python -c "import setuptools, tokenize;file='/private/var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-build-uOkJfF/PyICU/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/k1/6_4k217j1ng5qnm8_vrpx1b80000gp/T/pip-EdbjO8-record/install-record.txt --single-version-externally-managed --compile: running install running build running build_py creating build creating build/lib.macosx-10.10-intel-2.7 copying icu.py -> build/lib.macosx-10.10-intel-2.7 copying PyICU.py -> build/lib.macosx-10.10-intel-2.7 copying docs.py -> build/lib.macosx-10.10-intel-2.7 running build_ext building '_icu' extension creating build/temp.macosx-10.10-intel-2.7 cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/usr/local/include -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _icu.cpp -o build/temp.macosx-10.10-intel-2.7/_icu.o -DPYICU_VER="1.9.2" In file included from _icu.cpp:27: ./common.h:86:10: fatal error: 'unicode/utypes.h' file not found #include <unicode/utypes.h> ^ 1 error generated. error: command 'cc' failed with exit status 1

opened by iamtrask 13
polyglot windows installation

I can't install polyglot on Windows 7 64 bit. I have tried python 3.4, 3.5, 3.6, various versions of PyICU, numpy, PyCld2 modules from http://www.lfd.uci.edu/~gohlke/pythonlibs/ but still without success. If somebody was successful with polyglot installation on Windows - could you please publish the successful combination of versions for: Python, PyICU, numpy and PyCld2 modules and type of their installations - wheel, github, pip etc..? Maybe some other tips?

I will really appreciate your help. Thank you.

Paul

opened by netgateseznamcz 10
installation faild

I want to use your tool polyglotner,But it appears to me this error and tried to repair it frequently but I could not, I need this tool a lot. can you help me ?please? Thank you very much

opened by zainjaradat 8

Not able to install polyglot for Windows 10, Python version 3.6.5

Hello. I have been trying to install Polyglot on my Windows 10 machine but to no avail. I tried to solve this error through the various issues posted here, but none of them work for me.

It seems the issue arises when trying to install PyICU which is part of Polyglot's installation. I git cloned the repo and used python setup.py install to do so (since pip install polyglot gives an encoding error from cp1252 even though my Python's default encoding is UTF-8).

Searching for PyICU>=1.8
Reading https://pypi.python.org/simple/PyICU/
Downloading https://pypi.python.org/packages/bb/ef/3a7fcbba81bfd213e479131ae21445a2ddd14b46d70ef0109640b580bc5d/PyICU-2.0.3.tar.gz#md5=f2e696a3680be895170282297e036f40
Best match: PyICU 2.0.3
Processing PyICU-2.0.3.tar.gz
Writing C:\Users\me\AppData\Local\Temp\easy_install-188nt5fk\PyICU-2.0.3\setup.cfg
Running PyICU-2.0.3\setup.py -q bdist_egg --dist-dir C:\Users\me\AppData\Local\Temp\easy_install-188nt5fk\PyICU-2.0.3\egg-dist-tmp-d_y2eb25

Building PyICU 2.0.3 for ICU 2.0.3

_icu.cpp
c:\users\me\appdata\local\temp\easy_install-188nt5fk\pyicu-2.0.3\common.h(105): fatal error C1083: Cannot open include file: 'unicode/utypes.h': No such file or directory
error: Setup script exited with error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Enterprise\\VC\\Tools\\MSVC\\14.13.26128\\bin\\HostX86\\x86\\cl.exe' failed with exit status 2

Please help me out; I am totally stumped and have been trying for too long to fix this!

P.S: This is my first ever proper issue posted, so please go easy on me and let me know what else is considered helpful in such a forum.

opened by danyal-s 7

outdated model for 15.04.19?

Hi,

I just upgraded to polyglot 15.04.19 and it seems the model needs to be updated too.

In [1]: from polyglot.downloader import downloader

In [2]: downloader.download("embeddings2.en")
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/ubuntu/polyglot_data...
Out[2]: True

In [3]: downloader.download("pos2.en")
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/ubuntu/polyglot_data...
Out[3]: True

In [4]: blob = """We will meet at eight o'clock on Thursday morning."""

In [5]: from polyglot.text import Text

In [6]: text = Text(blob)

In [7]: text.pos_tags
Out[7]:
[(u'We', u'INTJ'),
 (u'will', u'NOUN'),
 (u'meet', u'NOUN'),
 (u'at', u'ADP'),
 (u'eight', u'DET'),
 (u"o'clock", u'PART'),
 (u'on', u'ADP'),
 (u'Thursday', u'PART'),
 (u'morning', u'PART'),
 (u'.', u'ADV')]

Also you might want to update this too.

opened by geovedi 7

Named Entity Extraction does not seem to work

I would like to use the Named Entity Extraction of Polyglot, so I'm following the documentation at http://polyglot.readthedocs.org/en/latest/NamedEntityRecognition.html, however when I execute

print(downloader.supported_languages_table("ner2", 3))

I get the following error:

Traceback (most recent call last):
  File "C:/Users/text_analyzer_polyglot.py", line 22, in <module>
    main()
  File "C:/Users/text_analyzer_polyglot.py", line 18, in main
    print(downloader.supported_languages_table("ner2", 3))
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 963, in supported_languages_table
    languages = self.supported_languages(task)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 955, in supported_languages
    collection = self.get_collection(task=task)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in get_collection
    if task: raise TaskNotSupported("Task {} is not supported".format(id))
polyglot.downloader.TaskNotSupported: Task TASK:ner2 is not supported

In addition, if I try to execute:

blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
    text = Text(blob)
    print (text.entities)

I get the following error:

Traceback (most recent call last):
  File "C:/Users/text_analyzer_polyglot.py", line 23, in <module>
    main()
  File "C:/Users/text_analyzer_polyglot.py", line 20, in main
    print (text.entities)
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "C:\Python27\lib\site-packages\polyglot\text.py", line 124, in entities
    for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "C:\Python27\lib\site-packages\polyglot\text.py", line 96, in ne_chunker
    return get_ner_tagger(lang=self.language.code)
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 152, in get_ner_tagger
    return NEChunker(lang=lang)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 99, in __init__
    super(NEChunker, self).__init__(lang=lang)
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__
    self.predictor = self._load_network()
  File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in _load_network
    self.embeddings = load_embeddings(self.lang, type='cw')
  File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
    cache[key] = obj(*args, **kwargs)
  File "C:\Python27\lib\site-packages\polyglot\load.py", line 64, in load_embeddings
    p = locate_resource(src_dir, lang)
  File "C:\Python27\lib\site-packages\polyglot\load.py", line 47, in locate_resource
    if downloader.status(package_id) != downloader.INSTALLED:
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 730, in status
    info = self._info_or_id(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 500, in _info_or_id
    return self.info(info_or_id)
  File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 918, in info
    raise ValueError('Package %r not found in index' % id)
ValueError: Package u'embeddings2.en' not found in index

Am I missing something in the documentation? Could you tell me how to successfully run the Named Entity Extraction?

opened by valeriocos 6

The downloads server seems currently down
Hello! When I issued this command:

polyglot download embeddings2.en ner2.en

I received the following answer:

[polyglot_data] Error loading embeddings2.en: HTTP Error 503: Service [polyglot_data] Unavailable Error installing package. Retry? [n/y/e]

This has been happening for about 3 days (as far as I know) and in all sorts of circumstances. I think your downloads server is down. Any thoughts?
opened by georgiana-b 5
Licensing issue for polyglot

We are planning to use this library in our application which is licensed under GNU General Public License v3.0 causing me a risk of license.We are not doing any modifications in the source code of the library. Can GPL and our application be made proprietary? Can you please more insights into this?

Thanks in advance.

opened by ShanmukhaSridhar 0

polyglot download failing

After installing polyglot from source with pip install -U git+https://github.com/aboSamoor/polyglot.git@master I can't download models via CLI nor using the python library:

>>> polyglot download
Polyglot Downloader
---------------------------------------------------------------------------
  d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Collections:
Error reading from server: HTTP Error 404: Not Found

Any other alternative to access the models?

opened by jspablo 2

[1] 7092 segmentation fault python - Error is coming

After lot of struggle finally installed polyglot dependencies such as Pyicu. All is perfectly installed, but one amazing error is coming when I simply execute example mentioned in documentation such as import polyglot from polyglot.text import Text, Word text = Text("Bonjour, Mesdames.") print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

as soon as i enter print statement, I get this error [1] 7092 segmentation fault python - Error is coming and python through out to from python.

I am using my Mac Book Pro M1 with Mac os Monterey 12.6 python version 3.9.10

I have no idea what this issue is and how to resolve.

opened by asifkhan69 0
Underscores make sentences detected as English?

This sentence is detected as French, is 98 probabliity:

Celles qui n'encouragent guère, emprises de jalousie.

Chaning one char to underscore:

Celles qui n'encouragent gu_re, emprises de jalousie

Gives English in 98 prob. Clearly some bug. Any ideas?

opened by ndvbd 0
English to Japanese Transliteration

Hi @aboSamoor,

Thanks for this amazing library! I have a question regarding the transliteration of Enlgish to Japanese. As you might know, Japanese contains three different types of tokens namely Hiragana, Katakana and Kanji. I wanted to know the type of the token in which the transliteration from En to Ja is taking place here.

Thanks.

opened by tejassp2002 0
pip install polyglot error: subproc exit with error:

C:\Users\r>pip install polyglot Collecting polyglot Downloading polyglot-16.7.4.tar.gz (126 kB) ---------------------------------------- 126.3/126.3 kB 1.2 MB/s eta 0:00:00 Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [8 lines of output] Traceback (most recent call last): File "", line 2, in File "", line 34, in File "C:\Users\r\AppData\Local\Temp\pip-install-jiuhzdjt\polyglot_bd6a0716ccdf4fd7ae7fad12136682fa\setup.py", line 15, in readme = readme_file.read() File "C:\Users\r\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4941: character maps to [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details.

opened by ryan-seitz 0

Owner

RAMI ALRFOU

Research Scientist @ Google / Weekdays. --------------------------------------------------------------------------------------- A Bedouin Ranger / Weekends

GitHub http://polyglot-nlp.com

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 18, 2021

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

587 Dec 20, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2.3k Dec 29, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2k Feb 9, 2021

skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

Norsk Regnesentral (Norwegian Computing Center)

850 Dec 28, 2022

Multilingual text (NLP) processing toolkit

Related tags

Overview

polyglot

Features

Developer

Quick Tutorial

Language Detection

Tokenization

Part of Speech Tagging

Named Entity Recognition

Polarity

Embeddings

Morphology

Transliteration

Comments

---------------------------------------------------------------------------

---------------------------------------------------------------------------

Owner

RAMI ALRFOU

Multilingual text (NLP) processing toolkit

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

skweak: A software toolkit for weak supervision applied to NLP tasks

jiant is an NLP toolkit

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

jiant is an NLP toolkit

💫 Industrial-strength Natural Language Processing (NLP) in Python