Deduplication is the task to combine different representations of the same real world entity.

Last update: Nov 17, 2022

Related tags

Text Data & NLP deduplipy

Overview

DedupliPy

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

DedupliPy is an end-to-end solution with advantages over existing solutions:

active learning; no large manually labelled dataset required
during active learning, the user gets notified when the model converged and training may be finished
works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features)

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Install directly from Pypi:

pip install deduplipy

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply deduplication your Pandas dataframe df as follows:

myDedupliPy = Deduplicator(col_names=['name', 'address'])
myDedupliPy.fit(df)

This will start the interactive learning session in which you provide input on whether a pair is a match (y) or not (n). During active learning you will get the message that training may be finished once algorithm training has converged. Predictions on (new) data are obtained as follows:

result = myDedupliPy.predict(df)

Comments

load_data() fails

I just tried installing the library and running the tutorial, per the docs.

import pandas as pd
from deduplipy.datasets import load_data

df = load_data()

This gave the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_3092214/1389887512.py in <module>
----> 1 df = load_data()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_data(kind)
     36         return load_stoxx50()
     37     elif kind == 'voters':
---> 38         return load_voters()

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/datasets.py in load_voters()
     14 def load_voters() -> pd.DataFrame:
     15     file_path = resource_filename('deduplipy', os.path.join('data', 'voter_names.csv'))
---> 16     df = pd.read_csv(file_path)
     17     print("Column names: 'name', 'suburb', 'postcode'")
     18     return df

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

~/Development/calm-notebooks/venv/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    704                 encoding=ioargs.encoding,
    705                 errors=errors,
--> 706                 newline="",
    707             )
    708         else:

FileNotFoundError: [Errno 2] No such file or directory: '/home/vincent/Development/calm-notebooks/venv/lib/python3.7/site-packages/deduplipy/data/voter_names.csv'

Here's my watermark info:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

numpy       : 1.20.3
pandas      : 1.3.2
scikit-learn: 0.24.2
deduplipy   : 0.5

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

opened by koaning 11

Add a `conda` install option for `deduplipy`
It will be nice to have a conda install option. I have started the work on adding this library to conda-forge channel already (PR: https://github.com/conda-forge/staged-recipes/pull/17495). Once the PR is merged, you will be able to install deduplipy with:

conda install -c conda-forge deduplipy
opened by sugatoray 8
Error in MinHashSampler ("_create_minhash_pairs") when length of words within string compared are all equal to 1

Hi and thank you for creating the package! I am exploring its applicability on a data set and I run into an error. The data that I am using come from an ERP, so the user can insert whatever he wants sometimes erroneous data. I identified that when all words lengths within the string to be compared are equal to 1 I get an error on the MinHashSampler when trying to fit. I reproduced the error on an artificial dataset please see below:

def first_two_characters(x): return x[:2]

df = pd.DataFrame( data = [['george d'],['andy t'],['greg b'],['ret'],['pam'],['kos'],['andy'], ['pamela'],['pamla'],['kis'],['paul'],['paul d'], ['geirge d'],['ndy t'],['greg'],['retos'],['pipo'],['konstas'],['grig'],['gre'] ,['k i'] ], columns = ['name']) field_info = {'name':[ratio]} myDedupliPy = Deduplicator( field_info=field_info, interaction=False, rules={'name': [first_two_characters]}, verbose=1) myDedupliPy.fit(X = df[['name']],n_samples = 100)

If I remove the ['k i'] row it runs without errors. The error occurs when the MinHashSampler is called but I am not sure exactly what the function does and how to correct that.

I suppose that I could perform a check, for example counting the length of words of each row and omitting them, before calling the function but I wanted to check with you if you have any suggestion and recipes?

Thank you very much in advance,

ValueError Traceback (most recent call last)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:136, in Deduplicator.fit(self, X, n_samples) ]()124 def fit(self, X: pd.DataFrame, n_samples: int = 10_000) -> 'Deduplicator': 125 """ 126 Fit the deduplicator instance 127 (...) 134 135 """ -->136 pairs_table = self._create_pairs_table(X, n_samples) 137 similarities = self._calculate_string_similarities(pairs_table) 138[ self.myActiveLearner.fit(similarities)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/deduplicator/deduplicator.py:105, in Deduplicator._create_pairs_table(self, X, n_samples) ]()93 """ 94 Create sample of pairs 95 (...) 102 103 """ 104 n_samples_minhash = n_samples // 2 -->105 minhash_pairs = MinHashSampler(self.col_names).sample(X, n_samples_minhash) 106 # the number of minhash samples can be (much) smaller than n_samples//2, in such case take more random pairs: 107[ n_samples_naive = n_samples - len(minhash_pairs)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:128, in MinHashSampler.sample(self, X, n_samples, threshold) ]()114 def sample(self, X: pd.DataFrame, n_samples: int, threshold: float = 0.2) -> pd.DataFrame: 115 """ 116 Method to draw sample of pairs of size n_samples from dataframe X. Note that n_samples cannot be returned if 117 the number of pairs above the threshold is too low. (...) 126 127 """ -->128 minhash_pairs = self._create_minhash_pairs(X, threshold) 130 stratified_sample = self._get_stratified_sample(minhash_pairs, n_samples) 132[ non_stratified_sample = self._get_non_stratified_sample(minhash_pairs, stratified_sample, n_samples)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/deduplipy/sampling/minhash_sampling.py:49, in MinHashSampler._create_minhash_pairs(self, X, threshold) ]()47 minhash_pairs = pd.DataFrame() 48 for col in self.col_names: --->49 minhash_result = self.MinHasher.fit_predict(df, col) 51 # add other columns than the one used for minhashing 52 minhash_result = (minhash_result 53 .merge(df.drop(columns=[col]), left_on='row_number_1', right_on='row_number') 54[ .drop(columns=['row_number']))

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:154, in MinHash.fit_predict(self, df, col_name) ]()152 df_['row_number'] = np.arange(len(df_)) 153 df_ = self.sparse_vectorize(df, col_name) -->154 df_ = self.create_minhash_signatures(df) 155[ return self.create_pairs(df, col_name)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:88, in MinHash._create_minhash_signatures(self, df) ]()76 def _create_minhash_signatures(self, df: pd.DataFrame) -> pd.DataFrame: 77 """ 78 Apply minhashing to the column sparse_vector in Pandas dataframe df in the new column minhash_signature. 79 In addition, one column (e.g.: 'hash_{0}') per hash table is created. (...) 86 87 """ --->88 df['minhash_signature'] = df['sparse_vector'].apply(self._create_minhash) 89 # the following involved way of creating 'hash_' columns prevents efficiency warnings 90[ hash_df = df['minhash_signature'].apply(pd.Series)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs) ]()4323 def apply( 4324 self, 4325 func: AggFuncType, (...) 4328 **kwargs, 4329 ) -> DataFrame | Series: 4330 """ 4331 Invoke function on values of Series. 4332 (...) 4431 dtype: float64 4432 """ ->4433[ return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1082, in SeriesApply.apply(self) ]()1078 if isinstance(self.f, str): 1079 # if we are a string, try to dispatch 1080 return self.apply_str() ->1082[ return self.apply_standard()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/core/apply.py:1137, in SeriesApply.apply_standard(self) ]()1131 values = obj.astype(object)._values 1132 # error: Argument 2 to "map_infer" has incompatible type 1133[ # "UnionCallable[..., Any], str, List[Union[Callable[..., Any], str]], 1134[ # Dict[Hashable, UnionUnion[Callable[..., Any], str], 1135[ # List[Union[Callable[..., Any], str]]]]]"; expected ]()1136 # "Callable[[Any], Any]" ->1137 mapped = lib.map_infer( 1138 values, 1139 f, # type: ignore[arg-type] 1140 convert=self.convert_dtype, 1141 ) 1143 if len(mapped) and isinstance(mapped[0], ABCSeries): 1144 # GH#43986 Need to do list(mapped) in order to get treated as nested 1145 # See also GH#25959 regarding EA support 1146[ return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/pyminhash/pyminhash.py:73, in MinHash._create_minhash(self, doc) ]()71 hashes += self.b 72 hashes %= self.next_prime --->73 minhashes = hashes.min(axis=0) 74[ return minhashes

File ~/miniconda2/envs/py395dd/lib/python3.9/site-packages/numpy/core/_methods.py:44, in _amin(a, axis, out, keepdims, initial, where) ]()42 def _amin(a, axis=None, out=None, keepdims=False, 43 initial=_NoValue, where=True): --->44[ return umr_minimum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation minimum which has no identity]()

opened by gregskol 5
Active learning on Databricks

Dear @fritshermans!

Thank you for your excellent library, is there any way how to run "the interactive active learning mode" on Databricks? As far as I know there is no possibility to read shell input in Databricks in the notebook, so I am unable to confirm or reject matches during the active learning.

Thank you for any suggestion how to proceed here.

Cheers, Andrej

opened by azachar 3
Added conda install instruction
This PR is representative of the work done to add deduplipy to conda-forge.

Conda-forge PR:

https://github.com/conda-forge/staged-recipes/pull/17495

This PR updates the readme file and closes #15 .

[x] added conda-install instruction

[x] added relevant badges
opened by sugatoray 1
How is it different from https://github.com/dedupeio/dedupe

I currently use https://github.com/dedupeio/dedupe and was curious if you there are any specific pain points which this library solves over it? From my cursory look, is one of the major differences the use of modAL library rather than custom built active learning solution in dedupe?

(Apologies if I'm abusing the power of opening an issue to ask this question)

opened by abhilashchowdhary 1
Fitting and Null-Values (NaN)

Having a Null values in a column which is part of the training df, leads to an error in the sklean package in feature_extraction\text.py. I could of course set Null values to a default string ("None" for instance) however, would this not have an unwanted impact on the training itself?

Is it possible to exclude certain values in a series of a dataframe, to avoid blocking on them, if we would use a default string value instead of nan?

opened by Murat-Topuz 1

Owner

GitHub https://www.deduplipy.com

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 11, 2021

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 17, 2021

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

128 Dec 29, 2022

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

10 Jan 6, 2023

Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

408 Dec 29, 2022

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

11 Sep 25, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

2 Jan 18, 2022

SinglepassTextCluster, an TextCluster tools based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individual real-time corpus cluster task。基于single-pass算法思想的自动文本聚类小组件，内置tfidf和doc2vec两种文本向量方法，可自动输出聚类数目、类簇文档集合和簇类大小，用于自有实时数据的聚类任务。

项目的背景 SinglepassTextCluster, an TextCluster tool based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individ

34 Dec 18, 2022

Deduplication is the task to combine different representations of the same real world entity.

Related tags

Overview

DedupliPy

Documentation

Installation

Normal installation

Install to contribute

Usage

Comments

load_data() fails

Add a `conda` install option for `deduplipy`

Error in MinHashSampler ("_create_minhash_pairs") when length of words within string compared are all equal to 1

Active learning on Databricks

Added conda install instruction

How is it different from https://github.com/dedupeio/dedupe

Fitting and Null-Values (NaN)

Owner

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Open-World Entity Segmentation

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

Python package for performing Entity and Text Matching using Deep Learning.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Python package for performing Entity and Text Matching using Deep Learning.