Levenshtein and Hamming distance computation

Overview

distance - Utilities for comparing sequences

This package provides helpers for computing similarities between arbitrary sequences. Included metrics are Levenshtein, Hamming, Jaccard, and Sorensen distance, plus some bonuses. All distance computations are implemented in pure Python, and most of them are also implemented in C.

Installation

If you don't want or need to use the C extension, just unpack the archive and run, as root:

# python setup.py install

For the C extension to work, you need the Python source files, and a C compiler (typically Microsoft Visual C++ 2010 on Windows, and GCC on Mac and Linux). On a Debian-like system, you can get all of these with:

# apt-get install gcc pythonX.X-dev

where X.X is the number of your Python version.

Then you should type:

# python setup.py install --with-c

Note the use of the --with-c switch.

Usage

A common use case for this module is to compare single words for similarity:

>>> distance.levenshtein("lenvestein", "levenshtein")
3
>>> distance.hamming("hamming", "hamning")
1

If there is not a one-to-one mapping between sounds and glyphs in your language, or if you want to compare not glyphs, but syllables or phonems, you can pass in tuples of characters:

>>> t1 = ("de", "ci", "si", "ve")
>>> t2 = ("de", "ri", "si", "ve")
>>> distance.levenshtein(t1, t2)
1

Comparing lists of strings can also be useful for computing similarities between sentences, paragraphs, etc.:

>>> sent1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> sent2 = ['the', 'lazy', 'fox', 'jumps', 'over', 'the', 'crazy', 'dog']
>>> distance.levenshtein(sent1, sent2)
3

Hamming and Levenshtein distance can be normalized, so that the results of several distance measures can be meaningfully compared. Two strategies are available for Levenshtein: either the length of the shortest alignment between the sequences is taken as factor, or the length of the longer one. Example uses:

>>> distance.hamming("fat", "cat", normalized=True)
0.3333333333333333
>>> distance.nlevenshtein("abc", "acd", method=1)  # shortest alignment
0.6666666666666666
>>> distance.nlevenshtein("abc", "acd", method=2)  # longest alignment
0.5

jaccard and sorensen return a normalized value per default:

>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143

As for the bonuses, there is a fast_comp function, which computes the distance between two strings up to a value of 2 included. If the distance between the strings is higher than that, -1 is returned. This function is of limited use, but on the other hand it is quite faster than levenshtein. There is also a lcsubstrings function which can be used to find the longest common substrings in two sequences.

Finally, two convenience iterators ilevenshtein and ifast_comp are provided, which are intended to be used for filtering from a long list of sequences the ones that are close to a reference one. They both return a series of tuples (distance, sequence). Example:

>>> tokens = ["fo", "bar", "foob", "foo", "fooba", "foobar"]
>>> sorted(distance.ifast_comp("foo", tokens))
[(0, 'foo'), (1, 'fo'), (1, 'foob'), (2, 'fooba')]
>>> sorted(distance.ilevenshtein("foo", tokens, max_dist=1))
[(0, 'foo'), (1, 'fo'), (1, 'foob')]

ifast_comp is particularly efficient, and can handle 1 million tokens without a problem.

For more informations, see the functions documentation (help(funcname)).

Have fun!

Changelog

20/11/13:

  • Switched back to using the to-be-deprecated Python unicode api. Good news is that this makes the C extension compatible with Python 2.7+, and that distance computations on unicode strings is now much faster.
  • Added a C version of lcsubstrings.
  • Added a new method for computing normalized Levenshtein distance.
  • Added some tests.

12/11/13: Expanded fast_comp (formerly quick_levenshtein) so that it can handle transpositions. Fixed variable interversions in (C) levenshtein which produced sometimes strange results.

10/11/13: Added quick_levenshtein and iquick_levenshtein.

05/11/13: Added Sorensen and Jaccard metrics, fixed memory issue in Levenshtein.

Comments
  • question

    question

    hi,

    I use blockhash that returns strings of 64 characters corresponding to 256 bits (eg. 3fc03f803fc03f861f063f861f861f0e1f0f160f3e0f301f7e3f3e3f003f0038 for example) : for best results, can the string be used directly or must it be converted beforehand ?

    thanks for the advice, lacsaP.

    opened by patatetom 5
  • Feature request for a hamming circle?

    Feature request for a hamming circle?

    Would you be willing to include a function to create words of hamming distance n in a given alphabet from a target word?

    Let me know if this draft looks OK and I can create a PR with tests. I won't be able to write the C port however.

    from itertools import combinations, product
    
    def hamming_circle(word, n, alphabet):
        """Create all words of hamming distance `n` in a given alphabet from a
        target word.
    
        Examples
        ---
        >>> sorted(hamming_circle('abc', 0, 'abc'))
        ['abc']
        >>> sorted(hamming_circle('abc', 1, 'abc'))
        ['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
        >>> sorted(hamming_circle('aaa', 2, 'ab'))
        ['abb', 'bab', 'bba']
    
        """
        for positions in combinations(range(len(word)), n):
            for replacements in product(range(len(alphabet)), repeat=n):
                skip = False
                cousin = list(word)
    
                for position, replacement in zip(positions, replacements):
                    if cousin[position] == alphabet[replacement]:
                        skip = True
                    else:
                        cousin[position] = alphabet[replacement]
    
                if not skip:
                    yield ''.join(cousin)
    
    opened by clintval 0
  • Jaccard Distance 0 for unidentical strings

    Jaccard Distance 0 for unidentical strings

    The Jaccard distance gives a value of 0 for non-identical strings. E.g.

    string = 'eastern maine medical center'
    string2 = 'cascade medical center'
    distance.jaccard(string, string2)
    0.0
    
    opened by aniruddhadave 0
  • Jaccard and Sorensen distances don't check for empty strings

    Jaccard and Sorensen distances don't check for empty strings

    If the input on these two functions is two empty strings then I get the error: ZeroDivisionError: float division by zero. So I added a check whether we get empty strings (attached .txt file). _simpledists.txt

    opened by imoutidi 0
  • error C2099:  initializer is not a constant

    error C2099: initializer is not a constant

    hello , when i tried to compile it, whith c "python setup.py install --with-c" using msvc 2017 on windows 10, i got this error ,

    Cdistance / distance.c (647): error C2099: initializer is not a constant Cdistance / distance.c (689): error C2099: initializer is not a constant Cdistance / distance.c (762): warning C4090: 'function': different 'const' qualifiers Error: command 'C: \ Program Files (x86) \ Microsoft Visual Studio \ 2017 \ BuildTools \ VC \ Tools \ MSVC \ 14.10.25017 \ bin \ HostX64 \ x64 \ cl .exe 'failed with exit status 2

    thanks

    opened by amiarSlimane 0
  • Unicode comparison warning

    Unicode comparison warning

    I get warnings using the levenshtein distance regarding unicode comparisons:

    lib/python2.7/site-packages/distance/_levenshtein.py:100: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      if seq1 == seq2:
    lib/python2.7/site-packages/distance/_levenshtein.py:39: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      if seq1 == seq2:
    lib/python2.7/site-packages/distance/_levenshtein.py:60: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      cost = int(seq1[x - 1] != seq2[y - 1])
    
    opened by FraBle 0
Owner
null
Mirco Ravanelli 2.3k Dec 27, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 6, 2023
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 1.9k Feb 18, 2021
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
C.J. Hutto 3.8k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
C.J. Hutto 2.8k Feb 18, 2021
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

null 730 Jan 9, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 2, 2023
Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

Epicalable 50 Dec 31, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 1, 2023
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

null 3.2k Dec 30, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 11, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022