Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek

Last update: Jan 1, 2023

Related tags

Text Processing thefuzz

Overview

TheFuzz

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Requirements

Python 2.7 or higher
difflib
python-Levenshtein (optional, provides a 4-10x speedup in String Matching, though may result in differing results for certain cases)

For testing

pycodestyle
hypothesis
pytest

Installation

Using PIP via PyPI

pip install thefuzz

or the following to install python-Levenshtein too

pip install thefuzz[speedup]

Using PIP via Github

pip install git+git://github.com/seatgeek/[email protected]#egg=thefuzz

Adding to your requirements.txt file (run pip install -r requirements.txt afterwards)

git+ssh://[email protected]/seatgeek/[email protected]#egg=thefuzz

Manually via GIT

git clone git://github.com/seatgeek/thefuzz.git thefuzz
cd thefuzz
python setup.py install

Usage

>>> from thefuzz import fuzz
>>> from thefuzz import process

Simple Ratio

>>> fuzz.ratio("this is a test", "this is a test!")
    97

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio

>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 ">

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100 ">

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) ">

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

You can also pass additional parameters to extractOne method to make it use a specific scorer. A typical use case is to match file paths:

>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio) ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61) ">

>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
    ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

Comments

Incompatible with python-Levenshtein >=0.20

>>> process.extractOne('abc', ['abc', 'def', 'ghi'])
Traceback (most recent call last):
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/process.py", line 108, in extractWithoutOrder
    for key, choice in choices.items():
AttributeError: 'list' object has no attribute 'items'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/process.py", line 168, in extract
    return heapq.nlargest(limit, sl, key=lambda i: i[1]) if limit is not None else \
  File "/usr/lib/python3.10/heapq.py", line 563, in nlargest
    result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
  File "/usr/lib/python3.10/heapq.py", line 563, in <listcomp>
    result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/process.py", line 117, in extractWithoutOrder
    score = scorer(processed_query, processed)
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/fuzz.py", line 276, in WRatio
    base = ratio(p1, p2)
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/utils.py", line 38, in decorator
    return func(*args, **kwargs)
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/utils.py", line 29, in decorator
    return func(*args, **kwargs)
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/utils.py", line 47, in decorator
    return func(*args, **kwargs)
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/fuzz.py", line 28, in ratio
    return utils.intr(100 * m.ratio())
  File "/home/ziddey/projects/biyombo/venv_biyombo/lib/python3.10/site-packages/thefuzz/StringMatcher.py", line 64, in ratio
    self._ratio = ratio(self._str1, self._str2)
NameError: name 'ratio' is not defined

python-Levenshtein is now a metapackage, installing Levenshtein

https://github.com/maxbachmann/Levenshtein/blob/main/src/Levenshtein/init.py

opened by ziddey 3

thefuzz[speedup] causes error in python-levenshtein wheel

ERROR: Failed building wheel for python-levenshtein

This seems to be a common issue with the levenshtein package, see https://stackoverflow.com/questions/37676623/cant-install-levenshtein-distance-package-on-windows-python-3-5

Full log

  Building wheel for python-levenshtein (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [30 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.9
      creating build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/__init__.py -> build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/StringMatcher.py -> build/lib.linux-x86_64-3.9/Levenshtein
      running egg_info
      writing python_Levenshtein.egg-info/PKG-INFO
      writing dependency_links to python_Levenshtein.egg-info/dependency_links.txt
      writing entry points to python_Levenshtein.egg-info/entry_points.txt
      writing namespace_packages to python_Levenshtein.egg-info/namespace_packages.txt
      writing requirements to python_Levenshtein.egg-info/requires.txt
      writing top-level names to python_Levenshtein.egg-info/top_level.txt
      reading manifest file 'python_Levenshtein.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no previously-included files matching '*pyc' found anywhere in distribution
      warning: no previously-included files matching '*so' found anywhere in distribution
      warning: no previously-included files matching '.project' found anywhere in distribution
      warning: no previously-included files matching '.pydevproject' found anywhere in distribution
      writing manifest file 'python_Levenshtein.egg-info/SOURCES.txt'
      copying Levenshtein/_levenshtein.c -> build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/_levenshtein.h -> build/lib.linux-x86_64-3.9/Levenshtein
      running build_ext
      building 'Levenshtein._levenshtein' extension
      creating build/temp.linux-x86_64-3.9
      creating build/temp.linux-x86_64-3.9/Levenshtein
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/azureuser/virtualenvs/datatog-venv/include -I/usr/include/python3.9 -c Levenshtein/_levenshtein.c -o build/temp.linux-x86_64-3.9/Levenshtein/_levenshtein.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for python-levenshtein
  Running setup.py clean for python-levenshtein
Failed to build python-levenshtein
Installing collected packages: thefuzz, python-levenshtein
  Running setup.py install for python-levenshtein ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for python-levenshtein did not run successfully.
  │ exit code: 1
  ╰─> [30 lines of output]
      running install
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.9
      creating build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/__init__.py -> build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/StringMatcher.py -> build/lib.linux-x86_64-3.9/Levenshtein
      running egg_info
      writing python_Levenshtein.egg-info/PKG-INFO
      writing dependency_links to python_Levenshtein.egg-info/dependency_links.txt
      writing entry points to python_Levenshtein.egg-info/entry_points.txt
      writing namespace_packages to python_Levenshtein.egg-info/namespace_packages.txt
      writing requirements to python_Levenshtein.egg-info/requires.txt
      writing top-level names to python_Levenshtein.egg-info/top_level.txt
      reading manifest file 'python_Levenshtein.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no previously-included files matching '*pyc' found anywhere in distribution
      warning: no previously-included files matching '*so' found anywhere in distribution
      warning: no previously-included files matching '.project' found anywhere in distribution
      warning: no previously-included files matching '.pydevproject' found anywhere in distribution
      writing manifest file 'python_Levenshtein.egg-info/SOURCES.txt'
      copying Levenshtein/_levenshtein.c -> build/lib.linux-x86_64-3.9/Levenshtein
      copying Levenshtein/_levenshtein.h -> build/lib.linux-x86_64-3.9/Levenshtein
      running build_ext
      building 'Levenshtein._levenshtein' extension
      creating build/temp.linux-x86_64-3.9
      creating build/temp.linux-x86_64-3.9/Levenshtein
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/azureuser/virtualenvs/datatog-venv/include -I/usr/include/python3.9 -c Levenshtein/_levenshtein.c -o build/temp.linux-x86_64-3.9/Levenshtein/_levenshtein.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> python-levenshtein```

opened by Danferno 2

Add type hints in module stub files

Add Type hints in stub files. This fixes #4. The hints were added in stub files to maintain backwards compatibility. If it's likely this PR might get merged, we can add a py.typed file and add MyPy to the Github actions to check the type hints in new PRs.

opened by Harry-Lees 2
Could not find version that satisfies the requirement thefuzz (unavailable) (from versions: 0.18.0, 0.19.)

Im trying to install this package on my enviroment using miniconda 3. (So that I can then, on Spyder 5, use this enviroment). I followed these commands https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github

The final command was the same of which you have in your read.me file: pip install git+git://github.com/seatgeek/[email protected]#egg=thefuzz

but giving me these errors:

WARNING: Discarding git+git://github.com/seatgeek/[email protected]#egg=thefuzz. Command errored out with exit status 128: git clone -q git://github.com/seatgeek/thefuzz.git 'C:\Users\sergi\AppData\Local\Temp\pip-install-3jejnqto\thefuzz_b5aa560a5b744c2993afa49ee996a5df' Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement thefuzz (unavailable) (from versions: 0.18.0, 0.19.0) ERROR: No matching distribution found for thefuzz (unavailable)

opened by serge144 1
Contributions to the new repository
@bigtoast I have two questions regarding contributions to this new repository:

The ReadMe of the old repository says PRs and issues here will need to be resubmitted to TheFuzz. In the past SeatGeek did not really have time to review PRs. Did this change? It would not make much sense to invest time on them, when it is already known, that they will probably never be reviewed.

Is there any plan up to which point the package should keep supporting Python2.7?
opened by maxbachmann 1
Search for matches in an array of complex objects.

Hello :smiley:

Is there any planned feature to find matches inside an array of complex objects? I found a library that can do it but using NodeJS with Fuzzball. It would be nice to have this in python as well

e.g.

opened by matheusfenolio 1
Error when used with Amazon Redshift Python UDF

Hi there, I tried to use this function with Amazon Redshift as a Python UDF and I get the error: "ERROR: ImportError: No module named thefuzz."

However it works if I copy init.py at the root directory, e.g. one level up. Is it possible to make that permanent? Or any other idea on how to fix it? My Redshift function code is:

CREATE OR REPLACE FUNCTION fuzzy_test (string_a VARCHAR,string_b VARCHAR) RETURNS FLOAT IMMUTABLE AS $$ from thefuzz import fuzz

return fuzz.ratio (string_a,string_b) $$ LANGUAGE plpythonu;

I also tried "from thefuzz.thefuzz import fuzz" , no luck

opened by saeed2402 0

Fix tests for module-scoped logger

Fix to match changes made in https://github.com/seatgeek/thefuzz/pull/21.

test_thefuzz.py .................................................        [ 69%]
test_thefuzz_hypothesis.py .....................                         [ 98%]
test_thefuzz_pytest.py F                                                 [100%]

=================================== FAILURES ===================================
_____________________________ test_process_warning _____________________________

caplog = <_pytest.logging.LogCaptureFixture object at 0x7f9cab835070>

    def test_process_warning(caplog):
        """Check that a string reduced to 0 by processor logs a warning to stderr"""
    
        query = ':::::::'
        choices = [':::::::']
    
        _ = process.extractOne(query, choices)
    
        logstr = ("Applied processor reduces "
                  "input query to empty string, "
                  "all comparisons will have score 0. "
                  "[Query: ':::::::']")
    
        assert 1 == len(caplog.records)
        log = caplog.records[0]
    
        assert log.levelname == "WARNING"
>       assert log.name == "root"
E       AssertionError: assert 'thefuzz.process' == 'root'
E         - root
E         + thefuzz.process

test_thefuzz_pytest.py:[21](https://github.com/seatgeek/thefuzz/runs/6443070099?check_suite_focus=true#step:5:21): AssertionError
------------------------------ Captured log call -------------------------------
WARNING  thefuzz.process:process.py:84 Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: ':::::::']

https://github.com/seatgeek/thefuzz/actions/runs/2328290584

opened by hugovk 0

Missing type hints/library stubs
When using mypy (or other type checkers), it cannot analyze the library correctly, because type hints are unavaliable. Are you planning to implement type hints or stubs in the future?

Output of mypy:

error: Skipping analyzing "thefuzz": found module but no type hints or library stubs note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
opened by SimonWoidig 0
Add short description for each of the main functions demo'd on the README?

Greetings!

If you have the chance, do you think you could add a sentence or two for each of the usage examples on the main README describing exactly what is being done?

That would be really helpful to get started quickly with the library.

opened by khughitt 1
extractOne is sensitive to ordering

extractOne changes result based on ordering, and my guess is this is due to Levenstein distance being equal. Maybe a new distance measure is necessary?

e.g. process.extractOne("C++",["C#","C++","C",".NET"]) Out - ('C#', 100)

process.extractOne("C++",["C++","C#","C",".NET"]) Out - ('C++', 100)

Maybe a heuristic needs to come in for exact match?

opened by rafikmatta 0

NameError: name 'matching_blocks' is not defined

I'm getting the following error when deploying my app to heroku, but not when I develop locally:

File "/app/.heroku/python/lib/python3.9/site-packages/thefuzz/utils.py", line 38, in decorator
return func(*args, **kwargs)
File "/app/.heroku/python/lib/python3.9/site-packages/thefuzz/utils.py", line 29, in decorator
return func(*args, **kwargs)
File "/app/.heroku/python/lib/python3.9/site-packages/thefuzz/utils.py", line 47, in decorator
return func(*args, **kwargs)
File "/app/.heroku/python/lib/python3.9/site-packages/thefuzz/fuzz.py", line 47, in partial_ratio
blocks = m.get_matching_blocks()
File "/app/.heroku/python/lib/python3.9/site-packages/thefuzz/StringMatcher.py", line 58, in get_matching_blocks
self._matching_blocks = matching_blocks(self.get_opcodes(),
NameError: name 'matching_blocks' is not defined

Locally I'm using python 3.9.11, on heroku I have: python-3.9.15 as well as pip 22.3.1, setuptools 63.4.3 and wheel 0.37.1. Could this have something to do with 2to3 or modernize messing something up? I'm using thefuzz==0.19.0 and levenshtein==0.20.8. If I use fuzzywuzzy==0.18.0 and python-levenshtein==0.12.2 it works fine.

opened by staab 1

ENH: use `functools.lru_cache` to speed up
Levenshtein distance algorithm can't be vectorized. So the calculation would be very slow in large data.

An idea to speed up is using the cache. Use the accumulate case to show the cache.

def accumulate(x): return sum(range(x))

accumulate(100000000) needs 5s no matter if it is the first time running or the second time running in my local without lru_cache.

from functools import lru_cache @lru_cache def accumulate(x): return sum(range(x))

After adding lru_cache, the first running accumulate(100000000) still needs 5s. But the second time running accumulate(100000000) needs 0s.
opened by Zeroto521 0

BUG: Can't handle `nan`

Problem:

nan value is not None, so it will be skipped by check_for_none. And nan is float and don't have len method, so check_empty_string raises an error.

>>> from thefuzz.fuzz import ratio
>>> ratio("a", float("nan"))  # or `np.nan`
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Zero\mambaforge\envs\dtoolkit-dev\lib\site-packages\thefuzz\utils.py", line 38, in decorator
    return func(*args, **kwargs)
  File "C:\Users\Zero\mambaforge\envs\dtoolkit-dev\lib\site-packages\thefuzz\utils.py", line 29, in decorator
    return func(*args, **kwargs)
  File "C:\Users\Zero\mambaforge\envs\dtoolkit-dev\lib\site-packages\thefuzz\utils.py", line 46, in decorator
    if len(args[0]) == 0 or len(args[1]) == 0:
TypeError: object of type 'float' has no len()

Solution:

check_for_none need to be updated. or create a new decorator for handling nan value.

https://github.com/seatgeek/thefuzz/blob/6e68af84e086b3e5f7253d4f9b0d6c7313e34637/thefuzz/utils.py#L33-L39

opened by Zeroto521 0

Creating a consensus string

Hello, I have 10 strings of different lengths, all of which mean the same thing. I want to create a consensus string using fuzzywuzzy. Is this possible? Which single function or a combination do I use?

Thanks Abhijit

opened by sanyalab 0

Owner

SeatGeek

GitHub

Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

8.8k Jan 8, 2023

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

3k Jan 2, 2023

Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

2 Apr 8, 2022

strbind - lapidary text converter for translate an text file to the C-style string

strbind strbind - lapidary text converter for translate an text file to the C-style string. My motivation is fast adding large text chunks to the C co

1 Oct 22, 2021

Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

3 Aug 29, 2022

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

Python User Agents user_agents is a Python library that provides an easy way to identify/detect devices like mobile phones, tablets and their capabili

1.3k Dec 22, 2022

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

2 Dec 14, 2022

A Python package to facilitate research on building and evaluating automated scoring models.

Rater Scoring Modeling Tool Introduction Automated scoring of written and spoken test responses is a growing field in educational natural language pro

59 Oct 10, 2022

A query extract python package

4 Nov 28, 2021

🍋 A Python package to process food

Pyfood is a simple Python package to process food, in different languages. Pyfood's ambition is to be the go-to library to deal with food, recipes, on

8 Apr 4, 2022

A minimal python script for generating multiple onetime use bip39 seed phrases

seed_signer_ontimes WARNING This project has mainly been used for local development, and creation should be ran on a air-gapped machine. A minimal pyt

4 Sep 12, 2022

A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

574 Dec 20, 2022

PyNews 📰 Simple newsletter made with python 🐍🗞️

PyNews ?? Simple newsletter made with python Install dependencies This project has some dependencies (see requirements.txt) that are not included in t

4 Aug 21, 2022

Making simplex testing clean and simple

Making Simplex Project Testing - Clean and Simple What does this repo do? It organizes the python stack for the coding project What do I need to do in

1 Jan 30, 2022

A simple text editor for linux

wolf-editor A simple text editor for linux Installing using Deb Package Download newest package from releases CD into folder where the downloaded acka

5 Nov 30, 2021

Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

1 Jan 29, 2022

Free & simple way to encipher text

VenSipher VenSipher is a free medium through which text can be enciphered. It can convert any text into an unrecognizable secret text that can only be

3 Jan 28, 2022

🚩 A simple and clean python banner generator - Banners

?? A simple and clean python banner generator - Banners

12 Oct 9, 2022

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

1.2k Dec 16, 2022

A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

9 Sep 15, 2022