:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Overview

Dedupe Python Library

Tests PassingCoverage Status

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.

dedupe will help you:

  • remove duplicate entries from a spreadsheet of names and addresses
  • link a list with customer information to another with order history, even without unique customer IDs
  • take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record

dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Important links

dedupe library consulting

If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.

Tools built with dedupe

Dedupe.io

A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.

Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.

For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.

csvdedupe

Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.

Installation

Using dedupe

If you only want to use dedupe, install it this way:

pip install dedupe

Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.

Developing dedupe

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

mkvirtualenv dedupe
git clone git://github.com/dedupeio/dedupe.git
cd dedupe
pip install "numpy>=1.9"
pip install -r requirements.txt
cython src/*.pyx
pip install -e .

If these tests pass, then everything should have been installed correctly!

pytest

Afterwards, whenever you want to work on dedupe,

workon dedupe

Testing

Unit tests of core dedupe functions

pytest

Test using canonical dataset from Bilenko's research

Using Deduplication

python tests/canonical.py

Using Record Linkage

python tests/canonical_matching.py

Team

  • Forest Gregg, DataMade
  • Derek Eder, DataMade

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2019 Forest Gregg and Derek Eder. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Forest Gregg and Derek Eder. 2019. Dedupe. https://github.com/dedupeio/dedupe.

Comments
  • Proof of Concept for Data matching

    Proof of Concept for Data matching

    The William W. Cohen does wonderful task to present the problem and solution for the task of Constraint Data Matching. The aim of this report is to discuss the implementation details for the constraint Data Matching problem.

    Cohen’s paper outlines the following changes to the existing canopy clustering algorithm :-

    For the two Datasets A and B

    A]. Let PossibleCenters = A; B] Let Canopy(a) = {(a,b) : b B and approxDist(a,b) < Tloose}; C]. let Ttight = 0 (i.e only remove a from the set of Possible Centers).

    Implementing the following changes, will accommodate the above mentioned changes:-

    Read in data from multiple sources

    First of all, the data_d structure which is passed to the blockingFunction needs to be modified so as to identify each of the records present in data_d, from which dataset it belongs.

    def readData(filenames):
        """
        Read in our data from a CSV file and create a dictionary of records,
        where the key is a unique record ID and each value is a
        [frozendict](http://code.activestate.com/recipes/414283-frozen-dictionaries/)
        (hashable dictionary) of the row fields.
        """
        data_d = {}
        row_id = 0
        for filename in filenames:
                with open(filename) as f:
                    reader = csv.DictReader(f)
                    for file_no,row in enumerate(reader):
                        clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
                        clean_row.append(('dataSet',file_no))
                        data_d[row_id] = dedupe.core.frozendict(clean_row)
                        row_id += 1
        return data_d
    

    Step A

    To implement Step A. Since in the implementation of dedupe, the Center for the canopy is always selected from corpus_ids, we would need to modify the invertIndex() function, to add only the records Ids which belong to dataset A.

    def invertIndex(data, tfidf_fields, df_index=None):
    
        inverted_index = defaultdict(lambda : defaultdict(list))
        token_vector = defaultdict(dict)
        corpus_ids = set([])
    
        for (record_id, record) in data:
                if record[‘dataset’] == 0:
    corpus_ids.add(record_id)  # candidate for removal
    

    Step B

    To implement Step B, Since, each record of the candidate set is compared with the center, the candidate_set, should only include the elements from database B.

    def createCanopies(field, threshold, corpus_ids, token_vector, inverted_index,data):
    
    candidate_set = set((doc_id for token in center_tokens for doc_id in field_inverted_index[token]['occurrences'] if data[doc_id][‘dataset’] == 1))
    

    Note - We would need to pass the data

    Step C

    The Step C, will work by itself due to the changes made in the second step.

    Step D

    From each and every Block, Blocked Pairs are generated are for each of permutation and combination while choosing a threshold for clustering. The blocked pairs can be generated correctly, by making the following changes -

    def blockedPairs(blocks) :
        for block in blocks :
    
            block_pairs = itertools.combinations(block, 2)
    
            for pair in block_pairs :
                if (pair[0]['dataset'] != pair[1]['dataset'])
                    yield pair
    

    The major focus of this report was to implement the constrained Data Matching by making changes to the Blocking section of the code base.

    Step D

    The suggested modification to implement the Constraint Data Matching, would not affect the Active Learning of dedupe and hence, dedupe will generate still generate the pairs for the labeling which may belong to the same dataset. One possible solution -

    def dataSample(data, sample_size):
        '''Randomly sample pairs of records from a data dictionary'''
    
        data_list = data.values()
        data_list_A = []
        data_list_B = []
        for data in data_list:
            if data['dataset'] == 0:
                data_list_A.append(data)
            else:
                data_list_B.append(data)
    
        n_records = len(data_list_A) if len(data_list_A) < len(data_list_B) else len(data_list_B)
    
        random_pairs = dedupe.core.randomPairs(n_records, sample_size)
    
        return tuple((data_list_A[int(k1)], data_list_B[int(k2)]) for k1, k2 in random_pairs)
    
    opened by nikitsaraf 41
  • MemoryError

    MemoryError

    Using the pgsql_big_dedupe_example.py we have successfully run sets of a million and 10 million records but fail at 20 with a repeatable memory error. Via the stack trace we have found the error to occur with the scoreDuplicates method. It seems that it is one of two problems happening within the mapping and reducing steps.

    1. Within fillQueue, when records are being fetched from the records_queue the optimization step that is asserting current < last_rate and increasing chunk size by a constant multiple of 1.1 does not have an upper bound. Moreover, the chunked records are immediately being loaded into a list ( chunk = list(itertools.islice(iterable, int(chunk_size))) ). Without an upper bound this can grow exponentially. We have changed this to a constant of 1 and were able to get through the mapping step in scoreDuplicates.
    2. Within mergeScores all records from the score_queue are being concatenated into a numpy array. It seems this array is needed to compute the max pair length to generate the python_type to be used in the memmap? If so, is it possible to determine the max length without having to load all of the pairs into a numpy array? This way the scores can be written to the memmap in chunks.

    Any suggestions on how to overcome this error?

    opened by tendres 34
  • Remove the requirement for the records to have the Id of the type integer

    Remove the requirement for the records to have the Id of the type integer

    As far as I could understand, the requirement for the record Id to be integer, was to generate pairs of random records to be treated as DataSample.

    So, rather than imposing the restriction on the input file, we can generate the integer id virtually by the help of python's enumerate() function. This is the perfect use-case for the enumerate function().

    opened by nikitsaraf 23
  • Distributed

    Distributed

    Anyone working on making this library work in a distributed environment? Seems like it would be great for a Spark library. Basically use blocking to assign partitions

    opened by lucaswiser 22
  • Streaming matching

    Streaming matching

    Extend dedupe to handle duplicate identification on a N+1 basis allowing for on-the-fly detection? I.e. given a known set of contact data with N rows that underwent blocking to see if the new N+1th record is a duplicate with a match in N.

    I can see of two broad approaches. In the first, for every cluster of record duplicates we either choose or construct a 'representative' record. Then the task is to match the new record with against one of these 'representative' records. If we do match, we assign the new record to the cluster.

    In the second approach, we keep track of all the data, and their current cluster assignment. When we get a new record we recluster the relevant records. This approach would require we keep more information around and it would be more computationally expensive. It would also be possible that adding a new record would result in a currently clustered record being removed from a cluster. It also seems like this latter approach could be a lot more accurate.

    Thoughts @emphanos, @nikitsaraf, @markhuberty, @michaelwick?

    enhancement 
    opened by fgregg 20
  • How large a table is dedupe feasible on?

    How large a table is dedupe feasible on?

    I have 100 gig mysql table that it would be great to run this on. How long would you think dedupe would take to run on this, a week?

    Apologies if this isn't the proper forum for general questions.

    opened by lminer 19
  • The dbm solution seems making the blocking process extremely slow

    The dbm solution seems making the blocking process extremely slow

    For me I have a 30K records to match agains to, and if I use the default dbm way it takes more than 10 minutes to match on, for example a 2 entries records. During the matching you could see output like this:

    [2017-06-29 20:43:56,841: INFO/PoolWorker-1] 10000, 182.5327482 seconds
    [2017-06-29 20:53:40,909: INFO/PoolWorker-1] 20000, 758.9884932 seconds
    

    which I believe is the output from https://github.com/dedupeio/dedupe/blob/master/dedupe/blocking.py#L42

    As so far we have enough memory, I had to change the code here to let the blocking happen in a dictionary in memory : https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L1072

    Basically, instead of returning shelf, return an empty python dictionary:

    def _temp_shelve():
        fd, file_path = tempfile.mkstemp()
        os.close(fd)
    
        try:
            shelf = shelve.open(file_path, 'n',
                                          protocol=pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            if 'db type could not be determined' in str(e):
                os.remove(file_path)
                shelf = shelve.open(file_path, 'n',
                                    protocol=pickle.HIGHEST_PROTOCOL)
            else:
                raise
    
        return {}, file_path # return python dictionary instead of shelf
    

    This will make the blocking and matching process takes lots of memory but it can finish a 2 entries matching against 30K records in a few seconds.

    Does this looks normal?


    Also the dbm thing is not working for large data set on macOS, as by default there is no gdbm available for python3 on macOS (not exactly sure why) and it causes issue like this:

    HASH: Out of overflow pages.  Increase page size
    Traceback (most recent call last):
      File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
        shelf[k] += [(i, record, ids)]
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
        self.dict[key.encode(self.keyencoding)] = f.getvalue()
    _dbm.error: cannot add item to database
    
    Process finished with exit code 1
    

    also mentioned here: https://github.com/dedupeio/csvdedupe/issues/67


    And it would be nice if we could have an option on the matching API to decide whether using shelve(or dbm), I suppose.

    opened by liufuyang 16
  • start to use sklearn for ml algorithms

    start to use sklearn for ml algorithms

    relates to #991 and #990

    Todo

    • [x] get test passing
    • [ ] replace haversine dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html#sklearn.metrics.pairwise.haversine_distances
    • [ ] replace simple-cosine dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

    From trying to replace the cosine distance, it's pretty clearly not worth doing that unless we can batch the calls.

    Any other likely places where we can use sklearn or scipy code instead of an additional library or replace dedupe code, @fjsj , @NickCrews ?

    opened by fgregg 15
  • childProcessError when running deduce.partition in a Jupyter Notebook on OSX

    childProcessError when running deduce.partition in a Jupyter Notebook on OSX

    I am running dedupe in a Jupyter notebook on Mac. When I run this line of code:

    groups = deduper.partition(data, threshold=.7)
    

    I get this error at the same place each time, 360000:

    INFO:dedupe.blocking:340000, 10.0376252 seconds
    INFO:dedupe.blocking:350000, 10.3528052 seconds
    INFO:dedupe.blocking:360000, 10.6705842 seconds
    ---------------------------------------------------------------------------
    ChildProcessError                         Traceback (most recent call last)
     in 
    ----> 1 groups = deduper.partition(data, threshold=.7)
    
    ~/opt/anaconda3/lib/python3.7/site-packages/dedupe/api.py in partition(self, data, threshold)
        168         """
        169         pairs = self.pairs(data)
    --> 170         pair_scores = self.score(pairs)
        171         clusters = self.cluster(pair_scores, threshold)
        172 
    
    ~/opt/anaconda3/lib/python3.7/site-packages/dedupe/api.py in score(self, pairs)
        104                                            self.data_model,
        105                                            self.classifier,
    --> 106                                            self.num_cores)
        107         except RuntimeError:
        108             raise RuntimeError('''
    
    ~/opt/anaconda3/lib/python3.7/site-packages/dedupe/core.py in scoreDuplicates(record_pairs, data_model, classifier, num_cores)
        247     result = result_queue.get()
        248     if isinstance(result, Exception):
    --> 249         raise ChildProcessError
        250 
        251     if result:
    
    ChildProcessError: 
    

    it looks like the num_cores setting had something to do with it, I've tried with that setting set to None, 1, and 2 and all have the same outcome.

    I found this issue, which sounded somewhat familiar. So in case it helps here is the output of:

    import numpy
    print(numpy.__config__.__dict__)
    
    {'__name__': 'numpy.__config__', '__doc__': None, '__package__': 'numpy', '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x7fa3571b59d0>, '__spec__': ModuleSpec(name='numpy.__config__', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7fa3571b59d0>, origin='/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__config__.py'), '__file__': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__config__.py', '__cached__': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__pycache__/__config__.cpython-37.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built-in functions, exceptions, and other objects.\n\nNoteworthy: None is the `nil' object; Ellipsis represents `...' in slices.", '__package__': '', '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>), '__build_class__': <built-in function __build_class__>, '__import__': <built-in function __import__>, 'abs': <built-in function abs>, 'all': <built-in function all>, 'any': <built-in function any>, 'ascii': <built-in function ascii>, 'bin': <built-in function bin>, 'breakpoint': <built-in function breakpoint>, 'callable': <built-in function callable>, 'chr': <built-in function chr>, 'compile': <built-in function compile>, 'delattr': <built-in function delattr>, 'dir': <built-in function dir>, 'divmod': <built-in function divmod>, 'eval': <built-in function eval>, 'exec': <built-in function exec>, 'format': <built-in function format>, 'getattr': <built-in function getattr>, 'globals': <built-in function globals>, 'hasattr': <built-in function hasattr>, 'hash': <built-in function hash>, 'hex': <built-in function hex>, 'id': <built-in function id>, 'input': <bound method Kernel.raw_input of <ipykernel.ipkernel.IPythonKernel object at 0x7fa356643ed0>>, 'isinstance': <built-in function isinstance>, 'issubclass': <built-in function issubclass>, 'iter': <built-in function iter>, 'len': <built-in function len>, 'locals': <built-in function locals>, 'max': <built-in function max>, 'min': <built-in function min>, 'next': <built-in function next>, 'oct': <built-in function oct>, 'ord': <built-in function ord>, 'pow': <built-in function pow>, 'print': <built-in function print>, 'repr': <built-in function repr>, 'round': <built-in function round>, 'setattr': <built-in function setattr>, 'sorted': <built-in function sorted>, 'sum': <built-in function sum>, 'vars': <built-in function vars>, 'None': None, 'Ellipsis': Ellipsis, 'NotImplemented': NotImplemented, 'False': False, 'True': True, 'bool': <class 'bool'>, 'memoryview': <class 'memoryview'>, 'bytearray': <class 'bytearray'>, 'bytes': <class 'bytes'>, 'classmethod': <class 'classmethod'>, 'complex': <class 'complex'>, 'dict': <class 'dict'>, 'enumerate': <class 'enumerate'>, 'filter': <class 'filter'>, 'float': <class 'float'>, 'frozenset': <class 'frozenset'>, 'property': <class 'property'>, 'int': <class 'int'>, 'list': <class 'list'>, 'map': <class 'map'>, 'object': <class 'object'>, 'range': <class 'range'>, 'reversed': <class 'reversed'>, 'set': <class 'set'>, 'slice': <class 'slice'>, 'staticmethod': <class 'staticmethod'>, 'str': <class 'str'>, 'super': <class 'super'>, 'tuple': <class 'tuple'>, 'type': <class 'type'>, 'zip': <class 'zip'>, '__debug__': True, 'BaseException': <class 'BaseException'>, 'Exception': <class 'Exception'>, 'TypeError': <class 'TypeError'>, 'StopAsyncIteration': <class 'StopAsyncIteration'>, 'StopIteration': <class 'StopIteration'>, 'GeneratorExit': <class 'GeneratorExit'>, 'SystemExit': <class 'SystemExit'>, 'KeyboardInterrupt': <class 'KeyboardInterrupt'>, 'ImportError': <class 'ImportError'>, 'ModuleNotFoundError': <class 'ModuleNotFoundError'>, 'OSError': <class 'OSError'>, 'EnvironmentError': <class 'OSError'>, 'IOError': <class 'OSError'>, 'EOFError': <class 'EOFError'>, 'RuntimeError': <class 'RuntimeError'>, 'RecursionError': <class 'RecursionError'>, 'NotImplementedError': <class 'NotImplementedError'>, 'NameError': <class 'NameError'>, 'UnboundLocalError': <class 'UnboundLocalError'>, 'AttributeError': <class 'AttributeError'>, 'SyntaxError': <class 'SyntaxError'>, 'IndentationError': <class 'IndentationError'>, 'TabError': <class 'TabError'>, 'LookupError': <class 'LookupError'>, 'IndexError': <class 'IndexError'>, 'KeyError': <class 'KeyError'>, 'ValueError': <class 'ValueError'>, 'UnicodeError': <class 'UnicodeError'>, 'UnicodeEncodeError': <class 'UnicodeEncodeError'>, 'UnicodeDecodeError': <class 'UnicodeDecodeError'>, 'UnicodeTranslateError': <class 'UnicodeTranslateError'>, 'AssertionError': <class 'AssertionError'>, 'ArithmeticError': <class 'ArithmeticError'>, 'FloatingPointError': <class 'FloatingPointError'>, 'OverflowError': <class 'OverflowError'>, 'ZeroDivisionError': <class 'ZeroDivisionError'>, 'SystemError': <class 'SystemError'>, 'ReferenceError': <class 'ReferenceError'>, 'MemoryError': <class 'MemoryError'>, 'BufferError': <class 'BufferError'>, 'Warning': <class 'Warning'>, 'UserWarning': <class 'UserWarning'>, 'DeprecationWarning': <class 'DeprecationWarning'>, 'PendingDeprecationWarning': <class 'PendingDeprecationWarning'>, 'SyntaxWarning': <class 'SyntaxWarning'>, 'RuntimeWarning': <class 'RuntimeWarning'>, 'FutureWarning': <class 'FutureWarning'>, 'ImportWarning': <class 'ImportWarning'>, 'UnicodeWarning': <class 'UnicodeWarning'>, 'BytesWarning': <class 'BytesWarning'>, 'ResourceWarning': <class 'ResourceWarning'>, 'ConnectionError': <class 'ConnectionError'>, 'BlockingIOError': <class 'BlockingIOError'>, 'BrokenPipeError': <class 'BrokenPipeError'>, 'ChildProcessError': <class 'ChildProcessError'>, 'ConnectionAbortedError': <class 'ConnectionAbortedError'>, 'ConnectionRefusedError': <class 'ConnectionRefusedError'>, 'ConnectionResetError': <class 'ConnectionResetError'>, 'FileExistsError': <class 'FileExistsError'>, 'FileNotFoundError': <class 'FileNotFoundError'>, 'IsADirectoryError': <class 'IsADirectoryError'>, 'NotADirectoryError': <class 'NotADirectoryError'>, 'InterruptedError': <class 'InterruptedError'>, 'PermissionError': <class 'PermissionError'>, 'ProcessLookupError': <class 'ProcessLookupError'>, 'TimeoutError': <class 'TimeoutError'>, 'open': <built-in function open>, 'copyright': Copyright (c) 2001-2019 Python Software Foundation.
    All Rights Reserved.
    
    Copyright (c) 2000 BeOpen.com.
    All Rights Reserved.
    
    Copyright (c) 1995-2001 Corporation for National Research Initiatives.
    All Rights Reserved.
    
    Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
    All Rights Reserved., 'credits':     Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
        for supporting Python development.  See www.python.org for more information., 'license': Type license() to see the full license text, 'help': Type help() for interactive help, or help(object) for help about object., '__IPYTHON__': True, 'display': <function display at 0x7fa355065830>, '__pybind11_internals_v3_clang_libcpp_cxxabi1002__': <capsule object NULL at 0x7fa359712db0>, 'get_ipython': <bound method InteractiveShell.get_ipython of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fa356643b90>>}, '__all__': ['get_info', 'show'], 'os': <module 'os' from '/Users/calebkeller/opt/anaconda3/lib/python3.7/os.py'>, 'sys': <module 'sys' (built-in)>, 'extra_dll_dir': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/.libs', 'blas_mkl_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'blas_opt_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'lapack_mkl_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'lapack_opt_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'get_info': <function get_info at 0x7fa3571b07a0>, 'show': <function show at 0x7fa3571b0dd0>}
    
    opened by wrathagom 15
  • Uncertain pairs in active learning

    Uncertain pairs in active learning

    While using the active learning part, I call pair = deduper.uncertainPairs() and I found two issues:

    1. The returned list with uncertain pairs contains only one element.
    2. The tuple order is switched when the second dataset is smaller than the first. For example if data_1 has 10000 entries and data_2 has 100 entries, the pair will be [(item_from_data_2, item_from_data_1)]
    opened by ADiegoCAlonso 15
  • Error while configuring openblas for Mac OSX

    Error while configuring openblas for Mac OSX

    Same error as the line at the bottom... https://readthedocs.org/builds/dedupe/2316678/ x86_64-linux-gnu-gcc: error: src/cpredicates.c: No such file or directory x86_64-linux-gnu-gcc: fatal error: no input files compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 4

    opened by asharma567 14
  • Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bump pypa/cibuildwheel from 2.11.3 to 2.11.4

    Bumps pypa/cibuildwheel from 2.11.3 to 2.11.4.

    Release notes

    Sourced from pypa/cibuildwheel's releases.

    v2.11.4

    • 🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • 🛠 Updates CPython 3.11 to 3.11.1 (#1371)
    • 🛠 Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)
    • 📚 Added a reference to abi3audit to the docs (#1347)
    Changelog

    Sourced from pypa/cibuildwheel's changelog.

    v2.11.4

    24 Dec 2022

    • 🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)
    • 🛠 Updates CPython 3.11 to 3.11.1 (#1371)
    • 🛠 Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)
    • 📚 Added a reference to abi3audit to the docs (#1347)
    Commits
    • 27fc88e Bump version: v2.11.4
    • a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr
    • b9a3ed8 Update cibuildwheel/resources/build-platforms.toml
    • 3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)
    • 1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3
    • 22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config
    • 98fdf8c [pre-commit.ci] pre-commit autoupdate
    • cefc5a5 Update dependencies
    • e53253d ci: move to ubuntu 20
    • e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies github_actions 
    opened by dependabot[bot] 0
  • Reference plugin variables using

    Reference plugin variables using "module:class" strings

    Will close https://github.com/dedupeio/dedupe/issues/1085

    This probably could do with some more polishing, eg removing type from every Variable class, and swapping if variable_type == "Interaction" to parsing the variable type, then checkingif isinstance(variable_type, dedupe.variables.InteractionType)

    opened by NickCrews 3
  • setuptools plugin solution for variables

    setuptools plugin solution for variables

    Using setuptools plugin facilities to set up variables. will close #1085

    external plugins

    external dedupe-variable plugins will look like https://github.com/dedupeio/dedupe-variable-datetime

    in particular, they will have this in their pyproject.toml (or equivalent setup.cfg/setup.py)

    [project.entry-points]
    dedupevariables = {datetimetype = "datetimetype:DateTimeType"}
    

    https://github.com/dedupeio/dedupe-variable-datetime/blob/fcd36afd000641168d9ae369623df866eeac35f9/pyproject.toml#L16

    to do

    • [x] https://github.com/dedupeio/dedupe-variable-datetime
    • [ ] https://github.com/dedupeio/dedupe-variable-name
    • [ ] https://github.com/dedupeio/dedupe-variable-address
    • [ ] https://github.com/dedupeio/dedupe-variable-fuzzycategory
    opened by fgregg 8
  • Is incremental clustering supported?

    Is incremental clustering supported?

    This request would better fit the "Discussions" session of GitHub, however, as it does not seem to be used in this repository, I am posting this here (feel free to move it if you like).

    Essentially, my question is: is it possible to achieve incremental clustering starting from a base (large) data set and adding new records from time to time (without reanalysing everything from scratch)?

    Let me add a few more details: so far I realized a simple demo for my use case following the postgres example.

    After the initial run (done exactly in the same way as in the postgres example above), when new records become available I suspect that I should do something along these lines:

    1. Update the blocking_map (initially built here) to take into account the new records. If I correctly understood the purpose of the blocking map, new blocking rules should be generated if the new incoming records do not belong to existing clusters (and probably no new blocking rules should be generated if new records match an existing cluster?).

    2. Update the existing clusters with the new records. This is the incremental counterpart of what is done at this line of the one-shot use case:

      clustered_dupes = deduper.cluster(deduper.score(record_pairs(read_cur)),threshold=0.5)
      

      If A is the set of already examined records and N is the set of new records (possibly just one), I expect that the incremental version of this step will try to match all records inside N with all the other records in N and all the records in A (but will not compare pairs of records in A as they have already been clustered). According to how clusters are generated by the dedupe library (I still do not know!) this may cause two or more existing clusters containing records in A to be merged into one.

    Further questions:

    • The ultimate goal in my case is being able to quickly anser the question: "Given a (possibly never seen) record r, does it belong to an already existing cluster?". Of course, if I am able to incrementally add the record r to the data set inspected by the dedupe library I will be able to answer this question looking at which cluster r belongs (possibly just himself).

      However, depending on how the dedupe library works (which I do not know yet), checking if r matches other already seen records may be answered in another (faster) way, without simultaneously adding r to the inspected dataset (this operation may be delayed to a later time). Can you elaborate on this?

    • In my use case the initial data set will probably contain ~10M records, which will probably grow up to ~100M. Do you have any experience about the time and space resources needed with datasets of this size?

    Thank you a lot!

    opened by lmores 2
  • Can we overhaul internals of Variables

    Can we overhaul internals of Variables

    @fgregg how open are you to backwards-incompatible changes to the way that Variables are implemented? If we could go in there and overhaul their APIs I think that might make https://github.com/dedupeio/dedupe/pull/1102 be better. Not exactly sure what I'm looking for yet.

    I'm thinking of if someone has gone and subclassed Variable then this might break them.

    opened by NickCrews 1
  • Blocking as a feature for scoring

    Blocking as a feature for scoring

    Right now, blocking and scoring are two distinct phases.

    All the information about how two records came to be blocked together is unused by the scorer. This is a bit silly, as the fact that two records are blocked together by multiple predicates could be a pretty good indicator of co-reference.

    I'm not really clear what the best way to take advantage of blocking information in scoring is though.

    a few ideas:

    1. ensemble model. Treat each each blocking predicate as a classifier, and put them in an ensemble with the scorer
    2. blocking as feature: add dummy features indicating which predicate rules are cover a pair. these features get fed into the scorer

    In both cases, i'm not quite sure how to set up the training.

    opened by fgregg 1
Owner
Dedupe.io
De-duplicate and find matches in your Excel spreadsheet or database
Dedupe.io
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 17, 2021
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

null 63 Nov 17, 2022
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 1, 2023
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 7.8k Feb 12, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 7.9k Feb 17, 2021
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 461 Dec 28, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

null 276 Feb 9, 2021
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 5, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 3, 2022
FireFlyer Record file format, writer and reader for DL training samples.

FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux

null 77 Jan 4, 2023
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Sergei Averkiev 76 Dec 14, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.8k Dec 21, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.4k Feb 12, 2021
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk <[email protected]> and Michael

James Turk 1.4k Feb 17, 2021
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021