🎐 a python library for doing approximate and phonetic matching of strings.

Overview

jellyfish

https://coveralls.io/repos/jamesturk/jellyfish/badge.png?branch=master Documentation Status

Jellyfish is a python library for doing approximate and phonetic matching of strings.

Written by James Turk <[email protected]> and Michael Stephens.

See https://github.com/jamesturk/jellyfish/graphs/contributors for contributors.

See http://jellyfish.readthedocs.io for documentation.

Source is available at http://github.com/jamesturk/jellyfish.

Jellyfish >= 0.7 only supports Python 3, if you need Python 2 please use 0.6.x.

Included Algorithms

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

Example Usage

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
>>> jellyfish.metaphone(u'Jellyfish')
'JLFX'
>>> jellyfish.soundex(u'Jellyfish')
'J412'
>>> jellyfish.nysiis(u'Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex(u'Jellyfish')
'JLLFSH'

Running Tests

If you are interested in contributing to Jellyfish, you may want to run tests locally. Jellyfish uses tox to run tests, which you can setup and run as follows:

pip install tox
# cd jellyfish/
tox
Comments
  • added wagner_fischer_distance

    added wagner_fischer_distance

    Hello,

    I added Wagner Fischer distance which is an algorithm to find edit distance using dynamic programming. Essentially gives the same result as Levenshtein but there is an advantage.

    With Wagner Fischer Distance, you can look at the matrix and see it is working. You can traverse the minimal change path to find out which exact alphabet in the string must be "edited".

    See the matrix, https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm#Calculating_distance

    opened by ruchir594 19
  • stdbool.h issue during setup (Windows)

    stdbool.h issue during setup (Windows)

    Hey guys, I don't expect Windows to be a priority, but C99 isn't supported with visual studio any more, and the install of Jellyfish hits this error:

    jellyfish-master\cjellyfish\jellyfish.h(4) : fatal error C1083: Cannot open include file: 'stdbool.h': No such file or directory

    There's a lot of info out there to workaround this issue, but maybe you've already solved it?

    Thanks,

    -James

    opened by jgentes 12
  • change to MIT license

    change to MIT license

    jellyfish is excellent, I'm using it in Julia with some success.

    I was thinking of porting it to Julia properly, eg. creating a registered package, however Julia is trying hard not be constrained by the BSD license.

    Is there any possibility of changing the license to MIT?

    opened by samuelcolvin 10
  • ld returned 1 exit status

    ld returned 1 exit status

    Hello,

    I am attempting to install a module that is dependent upon jellyfish, but I can't seem to get jellyfish to install. I have tried to install using pip and from source. I get the same error every time:

    collect2.exe: error: ld returned 1 exit status error: command 'gcc' failed with exit status 1

    I have been unable to find a thread discussing this issue with jellyfish, and I am afraid I don't yet know enough to modify remedies used for different modules. Any thoughts?

    Please see the install output below:

    C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>gcc --version gcc (tdm-1) 4.7.1 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>python setup.py insta ll running install running bdist_egg running egg_info writing jellyfish.egg-info\PKG-INFO writing top-level names to jellyfish.egg-info\top_level.txt writing dependency_links to jellyfish.egg-info\dependency_links.txt reading manifest file 'jellyfish.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'jellyfish.egg-info\SOURCES.txt' installing library code to build\bdist.win32\egg running install_lib running build_ext building 'jellyfish' extension C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c jellyfishmodule.c -o build\temp.win32-2.7\Release\jellyfishmodule.o jellyfishmodule.c:319:5: warning: initialization from incompatible pointer type [enabled by default] jellyfishmodule.c:319:5: warning: (near initialization for 'jellyfish_methods[0] .ml_meth') [enabled by default] jellyfishmodule.c:323:5: warning: initialization from incompatible pointer type [enabled by default] jellyfishmodule.c:323:5: warning: (near initialization for 'jellyfish_methods[1] .ml_meth') [enabled by default] jellyfishmodule.c:327:5: warning: initialization from incompatible pointer type [enabled by default] jellyfishmodule.c:327:5: warning: (near initialization for 'jellyfish_methods[2] .ml_meth') [enabled by default] C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c jaro.c -o build\temp.win32-2.7\Release\jaro.o jaro.c: In function '_jaro_winkler': jaro.c:52:5: warning: implicit declaration of function 'alloca' [-Wimplicit-func tion-declaration] jaro.c:52:17: warning: incompatible implicit declaration of built-in function 'a lloca' [enabled by default] C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c hamming.c -o build\temp.win32-2.7\Release\hamming.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c levenshtein.c -o build\temp.win32-2.7\Release\levenshtein.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c damerau_levenshtein.c -o build\temp.win32-2.7\Release\damerau_levenshtein.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c mra.c -o build\temp.win32-2.7\Release\mra.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c soundex.c -o build\temp.win32-2.7\Release\soundex.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c metaphone.c -o build\temp.win32-2.7\Release\metaphone.o C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c nysiis.c -o build\temp.win32-2.7\Release\nysiis.o nysiis.c: In function 'nysiis': nysiis.c:13:5: warning: implicit declaration of function 'alloca' [-Wimplicit-fu nction-declaration] nysiis.c:13:18: warning: incompatible implicit declaration of built-in function 'alloca' [enabled by default] C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c porter.c -o build\temp.win32-2.7\Release\porter.o porter.c: In function 'step5': porter.c:362:7: warning: suggest parentheses around '&&' within '||' [-Wparenthe ses] writing build\temp.win32-2.7\Release\jellyfish.def C:\MinGW32\bin\gcc.exe -shared -s build\temp.win32-2.7\Release\jellyfishmodule.o build\temp.win32-2.7\Release\jaro.o build\temp.win32-2.7\Release\hamming.o buil d\temp.win32-2.7\Release\levenshtein.o build\temp.win32-2.7\Release\damerau_leve nshtein.o build\temp.win32-2.7\Release\mra.o build\temp.win32-2.7\Release\sounde x.o build\temp.win32-2.7\Release\metaphone.o build\temp.win32-2.7\Release\nysiis .o build\temp.win32-2.7\Release\porter.o build\temp.win32-2.7\Release\jellyfish. def -LC:\Python27\libs -LC:\Python27\PCbuild -lpython27 -o build\lib.win32-2.7\j ellyfish.pyd build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x188): undefined reference to _imp___Py_TrueStruct' build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x191): undefined reference to_imp___Py_ZeroStruct' build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x2ab): undefined reference to _imp__PyExc_TypeError' build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x5f8): undefined reference to_imp__PyExc_TypeError' collect2.exe: error: ld returned 1 exit status error: command 'gcc' failed with exit status 1

    C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>

    opened by choct155 10
  • Add property-based tests to ensure C and Python implementations of the same algorithms behave identically

    Add property-based tests to ensure C and Python implementations of the same algorithms behave identically

    This PR adds a suite of property-based tests that ensure that the C and Python implementations of the same algorithms behave identically.

    The tests show that we have a lot of cleanup work to do, with Hypothesis uncovering failing test cases for all but 3 of Jellyfish's algorithms:

    $ pytest -v jellyfish/test_properties.py 
    ====================================================== test session starts =======================================================
    platform darwin -- Python 3.5.2, pytest-3.0.4, py-1.4.31, pluggy-0.4.0 -- python3
    cachedir: .cache
    rootdir: .../jellyfish, inifile: 
    plugins: hypothesis-3.6.1
    collected 11 items 
    
    jellyfish/test_properties.py::test_py_soundex_equals_c_soundex FAILED
    jellyfish/test_properties.py::test_py_nysiis_equals_c_nysiis FAILED
    jellyfish/test_properties.py::test_py_match_rating_codex_equals_c_match_rating_codex FAILED
    jellyfish/test_properties.py::test_py_metaphone_equals_c_metaphone FAILED
    jellyfish/test_properties.py::test_py_porter_stem_equals_c_porter_stem FAILED
    jellyfish/test_properties.py::test_py_levenshtein_distance_equals_c_levenshtein_distance PASSED
    jellyfish/test_properties.py::test_py_damerau_levenshtein_distance_equals_c_damerau_levenshtein_distance FAILED
    jellyfish/test_properties.py::test_py_hamming_distance_equals_c_hamming_distance PASSED
    jellyfish/test_properties.py::test_py_jaro_distance_equals_c_jaro_distance FAILED
    jellyfish/test_properties.py::test_py_jaro_winkler_equals_c_jaro_winkler PASSED
    jellyfish/test_properties.py::test_py_match_rating_comparison_equals_c_match_rating_comparison Segmentation fault: 11
    

    The first problem to be addressed, in my view, is the segmentation fault reported in #73, since that interrupts the test run and prevents Hypothesis from providing a clean report of failing test cases.

    Here are some of the failing test cases:

    $ pytest -s jellyfish/test_properties.py 
    ====================================================== test session starts =======================================================
    platform darwin -- Python 3.5.2, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
    rootdir: .../jellyfish, inifile: 
    plugins: hypothesis-3.6.1
    collected 11 items 
    
    jellyfish/test_properties.py
    Falsifying example: test_py_soundex_equals_c_soundex(s='ı')
    Falsifying example: test_py_nysiis_equals_c_nysiis(s='\x80')
    Falsifying example: test_py_match_rating_codex_equals_c_match_rating_codex(s='\x80')
    Falsifying example: test_py_metaphone_equals_c_metaphone(s='0H')
    Falsifying example: test_py_porter_stem_equals_c_porter_stem(s='\x00')
    Falsifying example: test_py_damerau_levenshtein_distance_equals_c_damerau_levenshtein_distance(s1='Ā', s2='')
    Falsifying example: test_py_jaro_distance_equals_c_jaro_distance(s1='100200', s2='021')
    

    This PR updates the Travis CI build to run these tests, so if we merge this PR in that will technically break the build. We can do that, or we can keep this PR open and periodically re-run the build to track our progress as we clean up these bugs, merging it in only when everything has been taken care of.

    Fixes #68.

    opened by nchammas 9
  • Damerau-Levensthein distance doesn't work for higher unicode symbols

    Damerau-Levensthein distance doesn't work for higher unicode symbols

    Damerau-Levenshtein distance can not be calculated via cjellyfish extension if any of two words contains some unicode symbols(e.g. russian letters, some surrogates such as ŭ, etc)

    Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
    [GCC 5.3.1 20160413] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import jellyfish.cjellyfish as cjellyfish
    >>> import jellyfish._jellyfish as pyjellyfish
    >>> pyjellyfish.damerau_levenshtein_distance(u'хлеб', u'пиво')
    4
    >>> cjellyfish.damerau_levenshtein_distance(u'хлеб', u'пиво')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: Encountered unsupported code point in string.
    >>> cjellyfish.damerau_levenshtein_distance(u'tets', u'test')
    1
    >>> 
    
    $ pip show jellyfish
    
    ---
    Metadata-Version: 2.0
    Name: jellyfish
    Version: 0.5.6
    
    opened by kammala 9
  • Price-matching rockymadden's stringmetric

    Price-matching rockymadden's stringmetric

    • Phonetic Algorithm

      • Double Metaphone (Queued phonetic metric and algorithm)
      • NYSIIS (Phonetic metric and algorithm)
      • Refined NYSIIS (Phonetic metric and algorithm)
      • Refined Soundex (Phonetic metric and algorithm)
      • Soundex (Phonetic metric and algorithm)
    • Similarity Metrics

      • Dice / Sorensen (Similarity metric)
      • Jaccard (Similarity metric)
      • Monge-Elkan (Queued similarity metric)
      • Needleman-Wunch (Queued similarity metric)
      • N-Gram (Similarity metric)
      • Overlap (Similarity metric)
      • Ratcliff-Obershelp (Similarity metric)
      • Tanimoto (Queued similarity metric)
      • Tversky (Queued similarity metric)
      • Smith-Waterman (Queued similarity metric)
      • Weighted Levenshtein (Similarity metric)

    Link: https://github.com/rockymadden/stringmetric Most edited branch: https://github.com/halfmatthalfcat/stringmetric

    opened by DonaldTsang 7
  • Any plans on providing wheels?

    Any plans on providing wheels?

    Is it possible to provide wheels based on https://github.com/pypa/manylinux , it will be easier to install this package inside docker images?

    EDIT: forked this repo and created wheels using manylinux, https://github.com/jezeniel/jellyfish-wheel will deprecate my fork once this is resolved.

    feature request 
    opened by jezeniel 6
  • jellyfish 0.6.1 is not working on python 3.6

    jellyfish 0.6.1 is not working on python 3.6

    I have python3.6 in my windows pc. i am not able to import jellyfish and i am getting below error.

    import jellyfish. jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')

    Error:

    ModuleNotFoundError Traceback (most recent call last) in () ----> 1 import jellyfish 2 jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')

    ModuleNotFoundError: No module named 'jellyfish'

    opened by GuruMahesh4444 6
  • damerau_levenshtein.c segfault

    damerau_levenshtein.c segfault

    First off, thanks for the library! It's been great.

    Now the issue-- I've been getting a segfault (reproduced on both Ubuntu and OS X) in damerau_levenshtein_distance, called from the python library. Strings being compared were "mylifeoutdoors" and "нахлыст".

    Poking around in gdb, it looks like the unicode characters cause a bad lookup in the "da" array. If the Cyrillic characters are out of scope for this library, would you be opposed to a change to detect out of bounds code points?

    Jellyfish version: 0.5.3

    Stack trace: #0 0x00007feea2c5a340 in damerau_levenshtein_distance (s1=0x7fee9aeee630, s2=0x7fee9aea3eb0, len1=14, len2=7)

    at cjellyfish/damerau_levenshtein.c:58
        infinite = 21
        cols = 9
        i = 1
        j = 7
        i1 = 140662777395312
        j1 = 0
        db = 0
        d1 = 7
        d2 = 7
        d3 = 8
        d4 = <error reading variable d4 (Cannot access memory at address 0x23fb1b90e40100)>
        result = <optimized out>
        dist = 0x18c2180
        da = 0x1c7c510
    

    #1 0x00007feea2c597b6 in jellyfish_damerau_levenshtein_distance (self=, args=)

    at cjellyfish/jellyfishmodule.c:156
        s1 = 0x7fee9aeee630
        s2 = 0x7fee9aea3eb0
        len1 = 14
        len2 = 7
        result = <optimized out>
    
    opened by tsellon 6
  • String Functions should operate on wchar_t instead of char

    String Functions should operate on wchar_t instead of char

    First, I'd like to say that this library is awesome. It has all the greats in one place and doesn't try to do anything too fancy. The implementations are clean as well.

    Currently, the library doesn't really have support for unicode strings. If I try to submit a unicode string with a non-ascii character, I get a traceback:

    >>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u4faf' in position 0: ordinal not in range(128)
    

    In addition, when one does encode them so that they will appropriately convert to an encoded string (non UCS-4 unicode), the incorrect answer is given. This example should give 1, rather than 3:

    >>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf'.encode('utf8'))
    3
    

    This obviously doesn't make sense for things like soundex or other english only algorithms, but as a general rule, python libraries should take only unicode objects for string operations.

    Patching this library to support unicode objects, rather than string objects shouldn't be too bad. You just need to replace the PyString_FromString with PyUnicode_FromWideChar and update the functions to use wchar_t. Here is the python c api unicode reference:

    http://docs.python.org/c-api/unicode.html

    If you want, I can fork this and send you a pull request. Just let me know.

    opened by stevvooe 6
Owner
James Turk
Principal Architect of @openstates. Formerly of @OpenPrecincts, @pbs, @sunlightlabs.
James Turk
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.4k Feb 17, 2021
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 5, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 6, 2023
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 2, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 11, 2021
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 17, 2021
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 461 Dec 28, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 276 Feb 9, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 1, 2023
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 7.8k Feb 12, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 7.9k Feb 17, 2021
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 ?? News

Neural Text Matching Community 3.7k Jan 2, 2023
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 ?? News

Neural Text Matching Community 3.4k Feb 18, 2021
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

null 286 Jan 2, 2023