# TextDistance

TextDistance -- python library for comparing distance between two or more sequences by many algorithms.

Features:

• 30+ algorithms
• Pure python implementation
• Simple usage
• More than two sequences comparing
• Some algorithms have more than one implementation in one class.
• Optional numpy usage for maximum speed.

## Algorithms

### Edit based

Algorithm Class Functions
Hamming `Hamming` `hamming`
MLIPNS `Mlipns` `mlipns`
Levenshtein `Levenshtein` `levenshtein`
Damerau-Levenshtein `DamerauLevenshtein` `damerau_levenshtein`
Jaro-Winkler `JaroWinkler` `jaro_winkler`, `jaro`
Strcmp95 `StrCmp95` `strcmp95`
Needleman-Wunsch `NeedlemanWunsch` `needleman_wunsch`
Gotoh `Gotoh` `gotoh`
Smith-Waterman `SmithWaterman` `smith_waterman`

### Token based

Algorithm Class Functions
Jaccard index `Jaccard` `jaccard`
Sørensen–Dice coefficient `Sorensen` `sorensen`, `sorensen_dice`, `dice`
Tversky index `Tversky` `tversky`
Overlap coefficient `Overlap` `overlap`
Tanimoto distance `Tanimoto` `tanimoto`
Cosine similarity `Cosine` `cosine`
Monge-Elkan `MongeElkan` `monge_elkan`
Bag distance `Bag` `bag`

### Sequence based

Algorithm Class Functions
longest common subsequence similarity `LCSSeq` `lcsseq`
longest common substring similarity `LCSStr` `lcsstr`
Ratcliff-Obershelp similarity `RatcliffObershelp` `ratcliff_obershelp`

### Compression based

Normalized compression distance with different compression algorithms.

Classic compression algorithms:

Algorithm Class Function
Arithmetic coding `ArithNCD` `arith_ncd`
RLE `RLENCD` `rle_ncd`
BWT RLE `BWTRLENCD` `bwtrle_ncd`

Normal compression algorithms:

Algorithm Class Function
Square Root `SqrtNCD` `sqrt_ncd`
Entropy `EntropyNCD` `entropy_ncd`

Work in progress algorithms that compare two strings as array of bits:

Algorithm Class Function
BZ2 `BZ2NCD` `bz2_ncd`
LZMA `LZMANCD` `lzma_ncd`
ZLib `ZLIBNCD` `zlib_ncd`

See blog post for more details about NCD.

### Phonetic

Algorithm Class Functions
MRA `MRA` `mra`
Editex `Editex` `editex`

### Simple

Algorithm Class Functions
Prefix similarity `Prefix` `prefix`
Postfix similarity `Postfix` `postfix`
Length distance `Length` `length`
Identity similarity `Identity` `identity`
Matrix similarity `Matrix` `matrix`

## Installation

### Stable

Only pure python implementation:

`pip install textdistance`

With extra libraries for maximum speed:

`pip install "textdistance[extras]"`

With all libraries (required for benchmarking and testing):

`pip install "textdistance[benchmark]"`

With algorithm specific extras:

`pip install "textdistance[Hamming]"`

Algorithms with available extras: `DamerauLevenshtein`, `Hamming`, `Jaro`, `JaroWinkler`, `Levenshtein`.

### Dev

Via pip:

`pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance`

Or clone repo and install with some extras:

```git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"```

## Usage

All algorithms have 2 interfaces:

1. Class with algorithm-specific params for customizing.
2. Class instance with default params for quick and simple usage.

All algorithms have some common methods:

1. `.distance(*sequences)` -- calculate distance between sequences.
2. `.similarity(*sequences)` -- calculate similarity for sequences.
3. `.maximum(*sequences)` -- maximum possible value for distance and similarity. For any sequence: `distance + similarity == maximum`.
4. `.normalized_distance(*sequences)` -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
5. `.normalized_similarity(*sequences)` -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

1. `qval` -- q-value for split sequences into q-grams. Possible values:
• 1 (default) -- compare sequences by chars.
• 2 or more -- transform sequences to q-grams.
• None -- split sequences by words.
2. `as_set` -- for token-based algorithms:
• True -- `t` and `ttt` is equal.
• False (default) -- `t` and `ttt` is different.

## Examples

For example, Hamming distance:

```import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2```

Any other algorithms have same interface.

## Articles

A few articles with examples how to use textdistance in the real world:

## Extra libraries

For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.

You can disable this by passing `external=False` argument on init:

```import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3```

Supported libraries:

Algorithms:

1. DamerauLevenshtein
2. Hamming
3. Jaro
4. JaroWinkler
5. Levenshtein

## Benchmarks

Without extras installation:

algorithm library function time
DamerauLevenshtein jellyfish damerau_levenshtein_distance 0.00965294
DamerauLevenshtein pyxdameraulevenshtein damerau_levenshtein_distance 0.151378
DamerauLevenshtein pylev damerau_levenshtein 0.766461
DamerauLevenshtein textdistance DamerauLevenshtein 4.13463
DamerauLevenshtein abydos damerau_levenshtein 4.3831
Hamming Levenshtein hamming 0.0014428
Hamming jellyfish hamming_distance 0.00240262
Hamming distance hamming 0.036253
Hamming abydos hamming 0.0383933
Hamming textdistance Hamming 0.176781
Jaro Levenshtein jaro 0.00313561
Jaro jellyfish jaro_distance 0.0051885
Jaro py_stringmatching jaro 0.180628
Jaro textdistance Jaro 0.278917
JaroWinkler Levenshtein jaro_winkler 0.00319735
JaroWinkler jellyfish jaro_winkler 0.00540443
JaroWinkler textdistance JaroWinkler 0.289626
Levenshtein Levenshtein distance 0.00414404
Levenshtein jellyfish levenshtein_distance 0.00601647
Levenshtein py_stringmatching levenshtein 0.252901
Levenshtein pylev levenshtein 0.569182
Levenshtein distance levenshtein 1.15726
Levenshtein abydos levenshtein 3.68451
Levenshtein textdistance Levenshtein 8.63674

Total: 24 libs.

Yeah, so slow. Use TextDistance on production only with extras.

Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).

You can run benchmark manually on your system:

```pip install textdistance[benchmark]
python3 -m textdistance.benchmark```

TextDistance show benchmarks results table for your system and save libraries priorities into `libraries.json` file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.

## Running tests

You can run tests via dephell:

```curl -L dephell.org/install | python3
dephell venv create --env=pytest-external
dephell deps install --env=pytest-external
dephell venv run --env=pytest-external```

## Contributing

PRs are welcome!

• Found a bug? Fix it!
• Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.
• Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.
• Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
• Have no time to code? Tell your friends and subscribers about `textdistance`. More users, more contributions, more amazing features.

Thank you ❤️

• #### add support for rapidfuzz

The implementation used by rapidfuzz has the following algorithms

• Jaro/JaroWinkler (fastest by a large margin)
• Hamming (slightly slower than python-Levenshtein)
• Levenshtein (similar fast to python-Levenshtein for very short strings and fastest for longer strings)

Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text

Here is the benchmark result:

``````# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0181046   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.030925    |
| Hamming            | Levenshtein             | hamming                      | 0.000351586 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.00040442  |
| Hamming            | jellyfish               | hamming_distance             | 0.0143502   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.000749048 |
| Jaro               | jellyfish               | jaro_similarity              | 0.0152322   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.000776006 |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0157833   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.0010058   |
| Levenshtein        | Levenshtein             | distance                     | 0.00103176  |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0147382   |
| Levenshtein        | pylev                   | levenshtein                  | 0.14116     |
Total: 13 libs.
``````

and the benchmark results when adding slightly longer strings:

``````STMT = """
func('text', 'test')
func('qwer', 'asdf')
func('a' * 15, 'b' * 15)
func('a' * 30, 'b' * 30)
"""
``````
``````# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0323887   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.143235    |
| Hamming            | Levenshtein             | hamming                      | 0.000489837 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.000517879 |
| Hamming            | jellyfish               | hamming_distance             | 0.0182341   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.00111363  |
| Jaro               | jellyfish               | jaro_similarity              | 0.0201971   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.00105238  |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0206678   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.00138601  |
| Levenshtein        | Levenshtein             | distance                     | 0.0034889   |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0232467   |
| Levenshtein        | pylev                   | levenshtein                  | 0.599603    |
Total: 13 libs.
``````
opened by maxbachmann 13
• #### Add new DamerauLevenshtein... classes

There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.

This PR splits introduces two different classes: `DamerauLevenshteinRestricted` and `DamerauLevenshteinUnrestricted`, with `DamerauLevenshtein` being the unrestricted version, so that it is clear what is intended.

opened by juliangilbey 7
• #### Ignore inconsistent timings on some comparison tests

Two particular tests have timings that differ wildly between successive runs on arm64 architectures. This might be because some libraries take a long time to load or something like that - I don't know. But this patch turns off hypothesis's timing checks for these two tests. I'm going to apply it to Debian's package; you might or might not want to apply it upstream.

opened by juliangilbey 5
• #### Modify JaroWinkler boosting to match behaviour of jellyfish algorithm

Jellyfish has recently modified its JaroWinkler algorithm to allow for boosting even when one of the strings is shorter than 4 characters: https://github.com/jamesturk/jellyfish/commit/87f9679910eba0dad6a1f6019f03cbdffba28392. It is very unclear whether this is a good idea or not. But as it is, the tests now fail, as the internal and external algorithms give different results on a pair of strings such as ":" and ":0".

This patch replicates the change that jellyfish has made, which will then allow the external tests to pass once again. It also modifies the expected value of the comparison "fog" and "frog" to match this new algorithm behaviour.

If you do not wish to apply this patch, then the external tests will need modifying to exclude the case where either of the strings has length < 4.

hacktoberfest-accepted
opened by juliangilbey 5
• #### Possible correction to Monge-Elkan calculation

Might be wrong about this, but think the code for the Monge-Elkan algorithm needs to be corrected.

If you look at the implementation in the py_stringmatching library on line 81 of https://github.com/anhaidgroup/py_stringmatching/blob/master/py_stringmatching/similarity_measure/monge_elkan.py `sim = float(sum_of_maxes) / float(len(bag1))` which is essentially the mean max.

But in the implementation for textdistance, the score is given on line 222 of https://github.com/life4/textdistance/blob/master/textdistance/algorithms/token_based.py as
`sum(maxes) / len(seq) / len(maxes)`

I think the further division by len(maxes) isn't needed, and the line should just be `sum(maxes) / len(seq)`

The change in the code could mess up tests elsewhere, so I'm not changing anything else. But thought I should bring this to your attention.

Below is some code and differing scores I got in textdistance and py_stringmatching.

``````# score in textdistance
from textdistance import MongeElkan, levenshtein
ALG = MongeElkan
score = ALG(algorithm=levenshtein,qval=None,symmetric=False).similarity('Good Times!', "The Good Times and The Bad Ones")
score
# Got 2.25
``````
``````#score in py_stringmatching
from py_stringmatching import MongeElkan
from py_stringmatching import Levenshtein as Levenshtein_2
ALG_2 = MongeElkan(sim_func=Levenshtein_2().get_raw_score)
source = 'Good Times!'
source_split = source.split()
target = "The Good Times and The Bad Ones"
target_split = target.split()
score2 = ALG_2.get_raw_score(source_split, target_split)
score2
# got 5.5
``````
opened by shijithpk 3
• #### Handle newer versions of abydos and jellyfish

abydos has changed its interface for distance metrics quite significantly, and jellyfish has changed the names of the functions. This patch addresses both of these issues.

opened by juliangilbey 3
• #### Ensure that maximum normalised distance is <= 1 and ...

textdistance is currently failing its test-suite on arm64 machines with Python 3.10, which is causing me problems on Debian. I have managed to track down the first of these bugs (and there are at least two more to come): there are some algorithms that use `upper()` before comparing the strings. As noted in the code already, though these algorithms were designed for English (ASCII only), this can cause `upper()` to change the length of the string if using non-English characters. And `hypothesis` does this when testing. This can result in the normalised distance being greater than 1. This patch addresses this by ensuring that the distance returned from the relevant algorithms is no greater than `self.maximum()`.

A second issue which arose when doing this was calculating the maximum distance for `Editex()`; the current function for calculating the maximum does not give the correct answer if `match_cost > mismatch_cost`, for example. But this would be a silly situation: why would we penalise matching characters more than mismatching ones? There are two ways of resolving this: the first is to calculate the maximum distance using `max(match_cost, group_cost, mismatch_cost)`, the second is to force the inequalities `match_cost <= group_cost <= mismatch_cost`. I have gone for the latter option in this patch.

All being well, there will be more patches to come in the next few weeks as I get to the bottom of them!

opened by juliangilbey 2
• #### update rapidfuzz

update rapidfuzz to the latest version which provides a damerau levenshtein implementation. It is the fastest of the supported libraries:

``````| algorithm          | library                               | function                     |        time |
|--------------------+---------------------------------------+------------------------------+-------------|
| DamerauLevenshtein | rapidfuzz.distance.DamerauLevenshtein | distance                     | 0.00267046  |
| DamerauLevenshtein | jellyfish                             | damerau_levenshtein_distance | 0.022479    |
| DamerauLevenshtein | pyxdameraulevenshtein                 | damerau_levenshtein_distance | 0.0393475   |
| DamerauLevenshtein | **textdistance**                      | DamerauLevenshtein           | 0.589098    |
``````

In addition it is the only implementation which only requires linear memory.

opened by maxbachmann 1
• #### Fix numpy types warnings

Basic types have been deprecated in numpy 1.20. Here are the full warnings:

``````DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
``````
``````DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
``````

I don’t know the code enough to assess if the specific numpy types are required though.

opened by ArchangeGabriel 1
• #### Fix a setuptools warning

UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead

opened by ArchangeGabriel 1

Hi,

Noticed that the Travis CI link was wrong. Then found a few more links that appear to reference an old repository.

This PR tries to correct the links by replacing orsinium by life4 in some URL's.

And thanks for the great project, Bruno

opened by kinow 1

## What's Changed

• Run Python 3.10 tests on CI by @orsinium in https://github.com/life4/textdistance/pull/80
• Type annotations by @orsinium in https://github.com/life4/textdistance/pull/82
• Add new DamerauLevenshtein... classes by @juliangilbey in https://github.com/life4/textdistance/pull/84

Full Changelog: https://github.com/life4/textdistance/compare/4.4.0...4.5.0

Source code(tar.gz)
Source code(zip)

## What's Changed

• update rapidfuzz by @maxbachmann in https://github.com/life4/textdistance/pull/83

Full Changelog: https://github.com/life4/textdistance/compare/4.3.0...4.4.0

Source code(tar.gz)
Source code(zip)

## What's Changed

• Ensure that maximum normalised distance is <= 1 and ... by @juliangilbey in https://github.com/life4/textdistance/pull/78
• Ignore inconsistent timings on some comparison tests by @juliangilbey in https://github.com/life4/textdistance/pull/79
• add support for rapidfuzz by @maxbachmann in https://github.com/life4/textdistance/pull/77

## New Contributors

• @maxbachmann made their first contribution in https://github.com/life4/textdistance/pull/77

Full Changelog: https://github.com/life4/textdistance/compare/4.2.2...4.3.0

Source code(tar.gz)
Source code(zip)

• #### v.4.2.0(Apr 13, 2020)

• Drop Python 2 support. We follow the official Python release cycle. Now CI runs for Python 3.6+. For 3.4 and 3.5 everything should still work but consider migration, it shouldn't be hard.
• We've migrated tests on pytest+hypothesis. It helped us to find a lot of bugs.
• Some fixes: a bug in Damerau-Levenshtein, normalization in Smith-Waterman, fix support for some unicode chars in Soundex.
• All classes now accept `external` argument even if they have no known external libs support.
Source code(tar.gz)
Source code(zip)
textdistance-4.2.0-py3-none-any.whl(28.43 KB)
textdistance-4.2.0.tar.gz(33.70 KB)

• #### 2.0.1(Feb 10, 2018)

###### Life4
Original cool Open Source projects
###### Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
###### Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

19 Oct 14, 2022
###### This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

1 Dec 9, 2021
###### Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

44 Jul 28, 2022
###### A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021
###### Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

12 Aug 19, 2021
###### Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
###### Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
###### This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

1 Dec 11, 2021
###### Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

457 Dec 23, 2022
###### Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

338 Dec 2, 2022
###### Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

142 Dec 21, 2022
###### Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

121 Jan 6, 2021
###### Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

22 Nov 16, 2022
###### A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

45 Jan 5, 2023
###### TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022
###### A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022
###### Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 4, 2022
###### Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

47 Sep 10, 2022