This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Maarten van Gompel

Last update: Dec 14, 2022

Related tags

Text Data & NLP python nlp tokenizer computational-linguistics text-processing nlp-library folia

Overview

http://applejack.science.ru.nl/lamabadge.php/python-ucto

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Ucto for Python

This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto).

Installation

Easy

On Arch Linux, use the python-ucto-git <https://aur.archlinux.org/packages/python-ucto-git/>_ package from the Arch User Repository (AUR).
In all other cases, for easy installation of both python-ucto as well as ucto itself, please use our LaMachine distribution (https://proycon.github.io/LaMachine)

Manual (Advanced)

Make sure to first install ucto itself (https://languagemachines.github.io/ucto) and all its dependencies.
Install Cython if not yet available on your system: $ sudo apt-get cython cython3 (Debian/Ubuntu, may differ for others)
Clone this repository and run: $ sudo python setup.py install (Make sure to use the desired version of python)

Advanced note: If the ucto libraries and includes are installed in a non-standard location, you can set environment variables INCLUDE_DIRS and LIBRARY_DIRS to point to them prior to invocation of setup.py install.

Usage

Import and instantiate the Tokenizer class with a configuration file.

import ucto
configurationfile = "tokconfig-eng"
tokenizer = ucto.Tokenizer(configurationfile)

The configuration files supplied with ucto are named tokconfig-xxx where xxx corresponds to a three letter iso-639-3 language code. There is also a tokconfig-generic one that has no language-specific rules. Alternatively, you can make and supply your own configuration file. Note that for older versions of ucto you may need to provide the absolute path, but the latest versions will find the configurations supplied with ucto automatically. See here for a list of available configuration in the latest version.

The constructor for the Tokenizer class takes the following keyword arguments:

lowercase (defaults to False) -- Lowercase all text
uppercase (defaults to False) -- Uppercase all text
sentenceperlineinput (defaults to False) -- Set this to True if each sentence in your input is on one line already and you do not require further sentence boundary detection from ucto.
sentenceperlineoutput (defaults to False) -- Set this if you want each sentence to be outputted on one line. Has not much effect within the context of Python.
paragraphdetection (defaults to True) -- Do paragraph detection. Paragraphs are simply delimited by an empty line.
quotedetection (defaults to False) -- Set this if you want to enable the experimental quote detection, to detect quoted text (enclosed within some sort of single/double quote)
debug (defaults to False) -- Enable verbose debug output

Text is passed to the tokeniser using the process() method, this method returns the number of tokens rather than the tokens itself. It may be called multiple times in sequence. The tokens themselves will be buffered in the Tokenizer instance and can be obtained by iterating over it, after which the buffer will be cleared:

#pass the text (a str) (may be called multiple times),
tokenizer.process(text)

#read the tokenised data
for token in tokenizer:
    #token is an instance of ucto.Token, serialise to string using str()
    print(str(token))

    #tokens remember whether they are followed by a space
    if token.isendofsentence():
        print()
    elif not token.nospace():
        print(" ",end="")

The process() method takes a single string (str), as parameter. The string may contain newlines, and newlines are not necessary sentence bounds unless you instantiated the tokenizer with sentenceperlineinput=True.

Each token is an instance of ucto.Token. It can be serialised to string using str() as shown in the example above.

The following methods are available on ucto.Token instances: * isendofsentence() -- Returns a boolean indicating whether this is the last token of a sentence. * nospace() -- Returns a boolean, if True there is no space following this token in the original input text. * isnewparagraph() -- Returns True if this token is the start of a new paragraph. * isbeginofquote() * isendofquote() * tokentype -- This is an attribute, not a method. It contains the type or class of the token (e.g. a string like WORD, ABBREVIATION, PUNCTUATION, URL, EMAIL, SMILEY, etc..)

In addition to the low-level process() method, the tokenizer can also read an input file and produce an output file, in the same fashion as ucto itself does when invoked from the command line. This is achieved using the tokenize(inputfilename, outputfilename) method:

tokenizer.tokenize("input.txt","output.txt")

Input and output files may be either plain text, or in the FoLiA XML format. Upon instantiation of the Tokenizer class, there are two keyword arguments to indicate this:

xmlinput or foliainput -- A boolean that indicates whether the input is FoLiA XML (True) or plain text (False). Defaults to False.
xmloutput or foliaoutput -- A boolean that indicates whether the input is FoLiA XML (True) or plain text (False). Defaults to False. If this option is enabled, you can set an additional keyword parameter docid (string) to set the document ID.

An example for plain text input and FoLiA output:

tokenizer = ucto.Tokenizer(configurationfile, foliaoutput=True)
tokenizer.tokenize("input.txt", "ucto_output.folia.xml")

FoLiA documents retain all the information ucto can output, unlike the plain text representation. These documents can be read and manipulated from Python using the FoLiaPy library. FoLiA is especially recommended if you intend to further enrich the document with linguistic annotation. A small example of reading ucto's FoLiA output using this library follows, but consult the documentation for more:

import folia.main as folia
doc = folia.Document(file="ucto_output.folia.xml")
for paragraph in doc.paragraphs():
    for sentence in paragraph.sentence():
        for word in sentence.words()
            print(word.text(), end="")
            if word.space:
                print(" ", end="")
        print()
    print()

Test and Example

Run and inspect example.py.

Comments

undefined symbol: ...
Hi there,

I have a clean ucto installation from sudo apt install ucto. When I compile the python extension, however, I can't import it since it fails with:

ImportError: /home/manjavacas/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/ucto.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN9Tokenizer14TokenizerClass4initERKSs

Not sure what might be going bad, since ucto works perfectly fine and the extension manages to compile without errors.

Any ideas?
question
opened by emanjavacas 8

Compilation fails after latest ucto release

    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/proycon/envs/dev
/include -I/usr/include/ -I/usr/include/libxml2 -I/usr/local/include/ -I/home/proycon/envs/dev/include -I/usr/include/python3.10 -c ucto_wrapper.cpp -o build/temp.linux-x86_64-3.10/ucto_wrapper.o --std=c++0x -D U_USING_ICU_NAMESPACE=1
    ucto_wrapper.cpp: In function ‘PyObject* __pyx_gb_4ucto_9Tokenizer_8generator(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
    ucto_wrapper.cpp:3750:86: error: no match for ‘operator=’ (operand types are ‘std::vector<std::__cxx11::basic_string<char> >’ and ‘std::vector<icu_70::UnicodeString>’)
     3750 |   __pyx_cur_scope->__pyx_v_results = __pyx_cur_scope->__pyx_v_self->tok.getSentences();

bug

opened by proycon 3

Tokenizer does not return lowercase tokens when lowercase = True

When I call tokenizer with lowercase True, the output contains tokens with uppercase.

t = ucto.Tokenizer("tokconfig-nld",lowercase = True,sentencedetection=False,paragraphdetection=False)
ucto: textcat configured from: /vol/customopt/lamachine.stable/share/ucto/textcat.cfg

z = x.article_set.all()[0]

t.process(z.text)

[str(token) for token in t]

["'", 'oor', 'onze', 'redacteur', 'mr.', 'F.', 'KUITENBROUWER', 'AMSTERDAM',
bug

opened by martijnbentum 3

Manual installation fails: config.h: no such file or directory

I’ve tried to follow the manual installation instructions on Ubuntu 16.04, but it seems to be missing a file:

ke@unut:~/git/python-ucto$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
ke@unut:~/git/python-ucto$ uname -a
Linux unut 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ke@unut:~/git/python-ucto$ sudo python setup.py install 
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'
  warnings.warn(msg)
running install
running build
running build_ext
cythoning ucto_wrapper2.pyx to ucto_wrapper2.cpp
building 'ucto' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/include/ -I/usr/include/libxml2 -I/usr/local/include/ -I/usr/include/python2.7 -c ucto_wrapper2.cpp -o build/temp.linux-x86_64-2.7/ucto_wrapper2.o --std=c++0x -D U_USING_ICU_NAMESPACE=1
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from ucto_wrapper2.cpp:457:0:
/usr/include/ucto/tokenize.h:33:20: fatal error: config.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

opened by texttheater 3

TokenRole has no attribute ENDOFQUOTE

Hi there, I noticed that isendofquote seems to be broken.

Seems like a typo on this line:

https://github.com/proycon/python-ucto/blob/65a7f03a92f60fa28e330a5fb735d75230cdbec4/ucto_wrapper.pyx#L29

which should be rather ENDOFQUOTE.
bug

opened by emanjavacas 1
Question: possible to retrieve untokenized sentences?

May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing a tuple that would hold two versions of a sentence, both the tokenized, as well as the original?

It is practical to keep the untokenized sentence in some scenarios (e.g. showing them to end users), and reconstructing it by script would be rather hacky and imprecise I guess.
enhancement

opened by pirolen 1

Releases(v0.6.1)

v0.6.1(Nov 30, 2022)

Fixes for Python package index and python wheels, no functional changes
Source code(tar.gz)
Source code(zip)
v0.6.0(Jul 22, 2022)
Adds support for locally installed uctodata (in $XDG_CONFIG_HOME/ucto, aka ~/.confg/ucto)

Adds an installdata() function to automatically download and install uctodata locally

Allow distribution as python wheels

Source code(tar.gz)
Source code(zip)
v0.5.3(Dec 17, 2021)

Compatibility with ucto v0.24.1 (breaks compatibility with earlier versions)
Source code(tar.gz)
Source code(zip)
v0.5.2(Oct 9, 2020)
Fixed lowercasing/uppercasing #8

Removed Python 2.7 support

Added a notice that sentencedetection is a deprecated parameter, rather than silently ignoring it

Source code(tar.gz)
Source code(zip)
v0.5.1(Aug 7, 2019)

Fix release for compatibility with moved libxml2 on (some?) macs. (proycon/LaMachine#154)
Source code(tar.gz)
Source code(zip)
v0.5.0(May 24, 2019)
Compatibility release for ucto v0.15

Source code(tar.gz)
Source code(zip)
v0.4.7(May 16, 2018)
Fixed issue #4 and compatibility release for ucto v0.13

Source code(tar.gz)
Source code(zip)
v0.4.5(Mar 29, 2018)

workaround for libicu 61 compatibility
Source code(tar.gz)
Source code(zip)
v0.4.4(Mar 7, 2018)
More fixes for Mac OS X compilation (with homebrew)

Source code(tar.gz)
Source code(zip)
v0.4.3(Mar 3, 2018)
Fix for Mac OS X compilation

Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 25, 2016)
No need for absolute path to configuration file anymore

Fix for Python 2.7 compatibility

Configurable include/library paths on setup

Better documentation

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 10, 2016)

Release FoLiA v1.0, compiles against new libfolia and ucto
Source code(tar.gz)
Source code(zip)
v0.2.4(Feb 10, 2016)

Stable release
Source code(tar.gz)
Source code(zip)

Owner

Maarten van Gompel

Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 Privacy, Security & Decentralisation

GitHub

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

726 Dec 30, 2022

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

2 Dec 18, 2021

A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

37 Dec 7, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

230 Nov 16, 2022

Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

228 Feb 17, 2021

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

1.5k Dec 26, 2022

DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

1.4k Feb 17, 2021

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

2 Nov 13, 2021

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

Related tags

Overview

Ucto for Python

Installation

Easy

Manual (Advanced)

Usage

Test and Example

Comments

undefined symbol: ...

Compilation fails after latest ucto release

Tokenizer does not return lowercase tokens when lowercase = True

Manual installation fails: config.h: no such file or directory

TokenRole has no attribute ENDOFQUOTE

Question: possible to retrieve untokenized sentences?

Releases(v0.6.1)

v0.6.1(Nov 30, 2022)

v0.6.0(Jul 22, 2022)

v0.5.3(Dec 17, 2021)

v0.5.2(Oct 9, 2020)

v0.5.1(Aug 7, 2019)

v0.5.0(May 24, 2019)

v0.4.7(May 16, 2018)

v0.4.5(Mar 29, 2018)

v0.4.4(Mar 7, 2018)

v0.4.3(Mar 3, 2018)

v0.4.0(Sep 25, 2016)

v0.3.0(Mar 10, 2016)

v0.2.4(Feb 10, 2016)

Owner

Maarten van Gompel

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

A high-level yet extensible library for fast language model tuning via automatic prompt search

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Yet another Python binding for fastText

Yet another Python binding for fastText

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

DELTA is a deep learning based natural language and speech processing platform.

DELTA is a deep learning based natural language and speech processing platform.

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation.