Python package for performing Entity and Text Matching using Deep Learning.

Last update: Dec 28, 2022

Related tags

Text Data & NLP deepmatcher

Overview

DeepMatcher

DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and utilities that enable you to train and apply state-of-the-art deep learning models for entity matching in less than 10 lines of code. The models are also easily customizable - the modular design allows any subcomponent to be altered or swapped out for a custom implementation.

As an example, given labeled tuple pairs such as the following:

DeepMatcher uses labeled tuple pairs and trains a neural network to perform matching, i.e., to predict match / non-match labels. The trained network can then be used to obtain labels for unlabeled tuple pairs.

Paper and Data

For details on the architecture of the models used, take a look at our paper Deep Learning for Entity Matching (SIGMOD '18). All public datasets used in the paper can be downloaded from the datasets page.

Quick Start: DeepMatcher in 30 seconds

There are four main steps in using DeepMatcher:

Data processing: Load and process labeled training, validation and test CSV data.

import deepmatcher as dm
train, validation, test = dm.data.process(path='data_directory',
    train='train.csv', validation='validation.csv', test='test.csv')

Model definition: Specify neural network architecture. Uses the built-in hybrid model (as discussed in section 4.4 of our paper) by default. Can be customized to your heart's desire.

model = dm.MatchingModel()

Model training: Train neural network.

model.run_train(train, validation, best_save_path='best_model.pth')

Application: Evaluate model on test set and apply to unlabeled data.

model.run_eval(test)

unlabeled = dm.data.process_unlabeled(path='data_directory/unlabeled.csv', trained_model=model)
model.run_prediction(unlabeled)

Installation

We currently support only Python versions 3.5 and 3.6. Installing using pip is recommended:

pip install deepmatcher

Note that during installation you may see an error message that says "Failed building wheel for fasttextmirror". You can safely ignore this - it does NOT mean that there are any problems with installation.

Tutorials

Using DeepMatcher:

Getting Started: A more in-depth guide to help you get familiar with the basics of using DeepMatcher.
Data Processing: Advanced guide on what data processing involves and how to customize it.
Matching Models: Advanced guide on neural network architecture for entity matching and how to customize it.

Entity Matching Workflow:

End to End Entity Matching: A guide to develop a complete entity matching workflow. The tutorial discusses how to use DeepMatcher with Magellan to perform blocking, sampling, labeling and matching to obtain matching tuple pairs from two tables.

DeepMatcher for other matching tasks:

Question Answering with DeepMatcher: A tutorial on how to use DeepMatcher for question answering. Specifically, we will look at WikiQA, a benchmark dataset for the task of Answer Selection.

API Reference

API docs are here.

Support

Take a look at the FAQ for common issues. If you run into any issues or have questions not answered in the FAQ, please file GitHub issues and we will address them asap.

The Team

DeepMatcher was developed by University of Wisconsin-Madison grad students Sidharth Mudgal and Han Li, under the supervision of Prof. AnHai Doan and Prof. Theodoros Rekatsinas.

Comments

ModuleNotFoundError: No module named 'torchtext.legacy'

Hi,

When trying to import deepmatcher, I am facing the error: ModuleNotFoundError: No module named 'torchtext.legacy'

Steps to recreate:

!pip install deepmatcher --user
import deepmatcher as dm

Stacktrace:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-4-e017bad088ef> in <module>
----> 1 import deepmatcher as dm

~/.local/lib/python3.8/site-packages/deepmatcher/__init__.py in <module>
      8 import sys
      9 
---> 10 from .data import process as data_process
     11 from .models import modules
     12 from .models.core import (MatchingModel, AttrSummarizer, WordContextualizer,

~/.local/lib/python3.8/site-packages/deepmatcher/data/__init__.py in <module>
----> 1 from .field import MatchingField, reset_vector_cache
      2 from .dataset import MatchingDataset
      3 from .iterator import MatchingIterator
      4 from .process import process, process_unlabeled
      5 from .dataset import split

~/.local/lib/python3.8/site-packages/deepmatcher/data/field.py in <module>
     11 import fasttext
     12 import torch
---> 13 from torchtext.legacy import data, vocab
     14 from torchtext.utils import download_from_url
     15 from urllib.request import urlretrieve

ModuleNotFoundError: No module named 'torchtext.legacy'

Please let me know how to fix this, or if you need more information.

opened by jacobceles 7

geting error when train model in google colab
i use google colab when i tried to process data with fasttext in french language i set it like this :

train_set,validation_set = dm.data.process( path='drive/My Drive/recommandersystem/deepmatcher_model', cache='train_cache.pth', train='train.csv', validation='valid.csv', embeddings='fasttext.fr.bin', embeddings_cache_path='drive/My Drive/recommandersystem/deepmatcher_model', ignore_columns=['id',''], id_attr='_id', label_attr='label', left_prefix='ltable_', right_prefix='rtable_')

and i get this error message :

HTTPError Traceback (most recent call last) in () 11 label_attr='label', 12 left_prefix='ltable_', ---> 13 right_prefix='rtable_')

13 frames

/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs) 648 class HTTPDefaultErrorHandler(BaseHandler): 649 def http_error_default(self, req, fp, code, msg, hdrs): --> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp) 651 652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

please how i can solve this
opened by walide67 6

Cannot achieve the precision/recall/F1 reported in SIGMOD18

Hi, I have just tested deepmatcher on the walmart-amazon dataset. However, I cannot achieve the precision/recall/F1 reported in SIGMOD18.

import deepmatcher as dm
import logging
import torch

logging.getLogger('deepmatcher.core').setLevel(logging.INFO)

model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN(word_aggregator='birnn-last-pool'))
# model = dm.MatchingModel()
print(model)
model.initialize(train_dataset)  # Explicitly initialize model.

model = model.cuda()
lr_decay = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]
batch_size = [16, 32]
pos_neg_ratio = [10, 7, 5, 4, 3, 2, 1]

best_f1 = -1
best_params = {'lr_decay': -1, 'batch_size': -1, 'pos_neg_ratio':-1}
for alpha in lr_decay:
    for b in batch_size:
        for rho in pos_neg_ratio:
            optimizer = dm.optim.Optimizer(lr_decay=alpha)
            model.run_train(
                train_dataset,
                validation_dataset,
                epochs=15,
                batch_size = b,
                pos_neg_ratio=rho,
                optimizer=optimizer,
                best_save_path='../output/walmart-amazon/rnn_model.pth')
            print(f'lr_decay: {alpha}, batch_size:{b}, pos_neg_ratio:{rho}')
            f1 = model.run_eval(test_dataset)
            if f1 > best_f1:
                best_f1 = f1
                best_params['lr_decay'] = alpha
                best_params['batch_size'] = b
                best_params['pos_neg_ratio'] = rho
                print(f'best f1-score: {best_f1}, {best_params}')
print(f'best f1-score: {best_f1}, {best_params}')

I haven't yet tested all the hyperparameters, but the available results show that f1-score is only about 37%. However, the f1 reported in SIGMOD18 is 67.6% for RNN on the structured walmart-amazon dataset. I use the data downloaded from the link in this repository.

Could you tell me what the problem is？ What are the hyper-parameters selected for the experiments in SIGMOD 18 (RNN)?

opened by EliasMei 3

can't save model state
hello world please please i need your help the date to presenting my graduation project is nearing i get error when i tried to save model my code is :

import deepmatcher as dm import pandas as pd import numpy as np import os train_set,validation_set = dm.data.process( path="deepmatcher_model/", cache='train_cache.pth', train='train.csv', validation='valid.csv', embeddings='fasttext.fr.bin', embeddings_cache_path="deepmatcher_model/", ignore_columns=['id','','ltable_index','rtable_index'], id_attr='_id', label_attr='label', left_prefix='ltable_', right_prefix='rtable_') model=dm.MatchingModel(attr_summarizer='hybrid') model.initialize(train_set) model.run_train( train_dataset=train_set, validation_dataset=validation_set, epochs=20, batch_size=16, best_save_path='deepmatcher_model/hybrid_model.pth', pos_neg_ratio=3) model.save_state('hybrid_model.pth',include_meta=True)

the error message :

AttributeError Traceback (most recent call last) in () ----> 1 model.save_state('hybrid_model.pth',include_meta=True)

4 frames /usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol) 196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol) 197 pickler.persistent_id = persistent_id --> 198 pickler.dump(obj) 199 200 serialized_storage_keys = sorted(serialized_storages.keys())

AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'

please tell me how can i solve this i use google colab .. thanks .. and i am sorry for my english @thodrek @sidharthms @hanli91 @anhaidgroup
opened by walide67 3
Can't download fasttext slovenian binaries

Fasttext binaries download does not work unless you select the english language. Looking at the code i presume that is due to the fact that the english binaries is located at some google drive folder, while the others will be automatically downloaded from an aws server, that it seems unreachable

opened by belerico 3

error while reading csv file- invalid literal for int() with base 10: 'match'

I get an error when using dm.data.process. Setup the data and code as per the default requirements, so not sure what's wrong here.

The csv file schema is as suggested: id, right_text, left_text, label. label contains match and non-match string entries

train, validation, test = dm.data.process(
    path=path,
    train='train.csv',
    validation='val.csv',
    test='test.csv',
    ignore_columns=('left_id', 'right_id'),
    left_prefix='left_',
    right_prefix='right_',
    label_attr='label',
    id_attr='id')

Error:

ValueError                                Traceback (most recent call last)
<ipython-input-33-9684d5da2beb> in <module>
      8     right_prefix='right_',
      9     label_attr='label',
---> 10     id_attr='id')

~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/process.py in process(path, train, validation, test, unlabeled, cache, check_cached_data, auto_rebuild_cache, tokenize, lowercase, embeddings, embeddings_cache_path, ignore_columns, include_lengths, id_attr, label_attr, left_prefix, right_prefix, use_magellan_convention, pca)
    218         check_cached_data,
    219         auto_rebuild_cache,
--> 220         train_pca=pca)
    221 
    222     # Save additional information to train dataset.

~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in splits(cls, path, train, validation, test, fields, embeddings, embeddings_cache, column_naming, cache, check_cached_data, auto_rebuild_cache, train_pca, **kwargs)
    559             dataset_args = {'fields': fields, 'column_naming': column_naming, **kwargs}
    560             train_data = None if train is None else cls(
--> 561                 path=os.path.join(path, train), **dataset_args)
    562             val_data = None if validation is None else cls(
    563                 path=os.path.join(path, validation), **dataset_args)

~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in __init__(self, fields, column_naming, path, format, examples, metadata, **kwargs)
    163                 examples = [make_example(line, fields) for line in
    164                     pyprind.prog_bar(reader, iterations=lines,
--> 165                         title='\nReading and processing data from "' + path + '"')]
    166 
    167             super(MatchingDataset, self).__init__(examples, fields, **kwargs)

~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in <listcomp>(.0)
    161 
    162                 next(reader)
--> 163                 examples = [make_example(line, fields) for line in
    164                     pyprind.prog_bar(reader, iterations=lines,
    165                         title='\nReading and processing data from "' + path + '"')]

~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/example.py in fromCSV(cls, data, fields, field_to_index)
     64     def fromCSV(cls, data, fields, field_to_index=None):
     65         if field_to_index is None:
---> 66             return cls.fromlist(data, fields)
     67         else:
     68             assert(isinstance(fields, dict))

~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/example.py in fromlist(cls, data, fields)
     82                         setattr(ex, n, f.preprocess(val))
     83                 else:
---> 84                     setattr(ex, name, field.preprocess(val))
     85         return ex
     86 

~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/field.py in preprocess(self, x)
    213             x = [w for w in x if w not in self.stop_words]
    214         if self.preprocessing is not None:
--> 215             return self.preprocessing(x)
    216         else:
    217             return x

~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/process.py in <lambda>(x)
     62         include_lengths=include_lengths)
     63     numeric_field = MatchingField(
---> 64         sequential=False, preprocessing=lambda x: int(x), use_vocab=False)
     65     id_field = MatchingField(sequential=False, use_vocab=False, id=True)
     66 

ValueError: invalid literal for int() with base 10: 'match'

opened by sidhantls 2

How is F1 calculated (micro, macro, weighted)?

Hi

Could you please confirm that when the following method is run:

model.run_eval(test)

Is the PRF1 score calculated using micro-averaging, or macro, or weighted-macro?

Thanks

opened by ziqizhang 2
SIGMOD experiments reproducibility

Hi, I am trying to reproduce the experiments from the SIGMOD 2018 paper: http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf. I am having a hard time finding the right setup and I get results far poorer than the ones reported in the paper for most of the datasets. Can you please give me a hint regarding the right setup? For example, what are the parameters for the hybrid setup? Using the defaults leads to poor results and following the existing guides in the repository did not help much.

As an example, for the (complete) iTunes-Amazon scenario the best I could obtain was F1: 35.09 | Prec: 33.33 | Rec: 37.04. But the paper reports better results.

Thank you!

opened by alex-bogatu 2
Why use mask_fill_ ?

Hi, I have just read your codes. I found that in many scripts you use 'masked_fill_' for tensor computation, eg., word_aggregators.py:139-140 word_comparators.py:173-175 word_contexualizers.py:157-159

Could you please tell me why you use masked_fill for tensors?

opened by fukien 2
OverflowError: Python int too large to convert to C long

import deepmatcher as dm import torch

torch.cuda.is_available() import pandas as pd pd.read_csv(r'E:\project\deepmatcher\examples\sample_data\itunes-amazon\train.csv').head()

train, validation, test = dm.data.process( path='E:\project\deepmatcher\examples\sample_data\itunes-amazon',train='train.csv', validation='validation.csv',test='test.csv') model = dm.MatchingModel()

Windows system How to deal with this problem？

opened by TianyiDuan 2
AttributeError: module 'torchtext.data' has no attribute 'Field'

I am trying to import deepmatcher using:

import deepmatcher as dm

and I am getting the following error message:

AttributeError: module 'torchtext.data' has no attribute 'Field'

opened by nilbahaaeddine 1
dm.data.preprocess no vectors found at /root/.vector_cache/wiki.en.bin
In collab:

After installing torchtext legacy 0.11 to bypass https://github.com/anhaidgroup/deepmatcher/issues/96 ,

When running dm.data.preprocess I get:

Reading and processing data from 0% [############################# ] 100% | ETA: 00:00:01 Reading and processing data from 0% [############################# ] 100% | ETA: 00:00:00INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh to /root/.vector_cache/wiki.en.bin /usr/local/lib/python3.7/dist-packages/deepmatcher/data/field.py:79: ResourceWarning: unclosed <ssl.SSLSocket fd=61, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.28.0.2', 47830), raddr=('172.253.123.113', 443)> self.destination = self.backup_destination ResourceWarning: Enable tracemalloc to get the object allocation traceback INFO:deepmatcher.data.field:Extracting vectors into /root/.vector_cache

RuntimeError Traceback (most recent call last)

in 2 path='', 3 train='train.csv', ----> 4 validation='validation.csv')

5 frames

/usr/local/lib/python3.7/dist-packages/deepmatcher/data/field.py in cache(self, name, cache, url, backup_url) 94 shutil.copyfileobj(infile, outfile) 95 if not os.path.isfile(path): ---> 96 raise RuntimeError('no vectors found at {}'.format(path)) 97 98 self.model = fasttext.load_model(path)

RuntimeError: no vectors found at /root/.vector_cache/wiki.en.bin

I bypassed this with the solution proposed in

https://github.com/anhaidgroup/deepmatcher/issues/57

and now:

Takes more space (Crucial for limited Collab capabilities)

Takes 12ish minutes to download only

But since this is not the same error produced I wanted to make sure this is known.
opened by OneCodeToRuleThemAll 0
Can't find train.csv, validate.csv and test.csv

Hi, I am playing with "Quick Start", and got this error message:

FileNotFoundError: [Errno 2] No such file or directory: 'data_directory\train.csv'

I am not able to manually find train.csv, validate.csv and test.csv. Please help me where I can find these files. Thank you

opened by xuezhongcai 0
Optimization of service running deepmatcher

My understanding is deepmatcher is designed for classification of pairs if they represent the same entity. In typical projects, say we have a product to add to the store -- we want to check if the product already exists in the store. During this inference, the product needs to be scored with respect to each item in the catalogue, which is computationally very expensive. What are the suggestions for optimization of this inference setup?

opened by vadimatwork 0
Direct inference on pandas dataframe

Hi, I see that to make a new inference everytime, I have to save a seperate CSV and then load it by providing path to dm.data.process_unlabeled

Is there a way to directly pass pandas dataframe to this function and perform inference without creating a new csv

opened by rbhatia46 2

Owner

GitHub

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

128 Dec 29, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

3.6k Jan 2, 2023

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 11, 2021

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 17, 2021

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

10 Jan 6, 2023

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

66 Jul 5, 2022

Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包，它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 ?? News

3.7k Jan 2, 2023

Facilitating the design, comparison and sharing of deep text matching models.

3.4k Feb 18, 2021

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

48 Oct 11, 2022

Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

121 Jan 3, 2023

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 2, 2023

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.6k Dec 27, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 11, 2021

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 17, 2021

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022