Python package for performing Entity and Text Matching using Deep Learning.

Overview

DeepMatcher

https://travis-ci.org/anhaidgroup/deepmatcher.svg?branch=master

DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and utilities that enable you to train and apply state-of-the-art deep learning models for entity matching in less than 10 lines of code. The models are also easily customizable - the modular design allows any subcomponent to be altered or swapped out for a custom implementation.

As an example, given labeled tuple pairs such as the following:

https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/docs/source/_static/match_input_ex.png

DeepMatcher uses labeled tuple pairs and trains a neural network to perform matching, i.e., to predict match / non-match labels. The trained network can then be used to obtain labels for unlabeled tuple pairs.

Paper and Data

For details on the architecture of the models used, take a look at our paper Deep Learning for Entity Matching (SIGMOD '18). All public datasets used in the paper can be downloaded from the datasets page.

Quick Start: DeepMatcher in 30 seconds

There are four main steps in using DeepMatcher:

  1. Data processing: Load and process labeled training, validation and test CSV data.
import deepmatcher as dm
train, validation, test = dm.data.process(path='data_directory',
    train='train.csv', validation='validation.csv', test='test.csv')
  1. Model definition: Specify neural network architecture. Uses the built-in hybrid model (as discussed in section 4.4 of our paper) by default. Can be customized to your heart's desire.
model = dm.MatchingModel()
  1. Model training: Train neural network.
model.run_train(train, validation, best_save_path='best_model.pth')
  1. Application: Evaluate model on test set and apply to unlabeled data.
model.run_eval(test)

unlabeled = dm.data.process_unlabeled(path='data_directory/unlabeled.csv', trained_model=model)
model.run_prediction(unlabeled)

Installation

We currently support only Python versions 3.5 and 3.6. Installing using pip is recommended:

pip install deepmatcher

Note that during installation you may see an error message that says "Failed building wheel for fasttextmirror". You can safely ignore this - it does NOT mean that there are any problems with installation.

Tutorials

Using DeepMatcher:

  1. Getting Started: A more in-depth guide to help you get familiar with the basics of using DeepMatcher.
  2. Data Processing: Advanced guide on what data processing involves and how to customize it.
  3. Matching Models: Advanced guide on neural network architecture for entity matching and how to customize it.

Entity Matching Workflow:

End to End Entity Matching: A guide to develop a complete entity matching workflow. The tutorial discusses how to use DeepMatcher with Magellan to perform blocking, sampling, labeling and matching to obtain matching tuple pairs from two tables.

DeepMatcher for other matching tasks:

Question Answering with DeepMatcher: A tutorial on how to use DeepMatcher for question answering. Specifically, we will look at WikiQA, a benchmark dataset for the task of Answer Selection.

API Reference

API docs are here.

Support

Take a look at the FAQ for common issues. If you run into any issues or have questions not answered in the FAQ, please file GitHub issues and we will address them asap.

The Team

DeepMatcher was developed by University of Wisconsin-Madison grad students Sidharth Mudgal and Han Li, under the supervision of Prof. AnHai Doan and Prof. Theodoros Rekatsinas.

Comments
  • ModuleNotFoundError: No module named 'torchtext.legacy'

    ModuleNotFoundError: No module named 'torchtext.legacy'

    Hi,

    When trying to import deepmatcher, I am facing the error: ModuleNotFoundError: No module named 'torchtext.legacy'

    Steps to recreate:

    !pip install deepmatcher --user
    import deepmatcher as dm
    

    Stacktrace:

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-4-e017bad088ef> in <module>
    ----> 1 import deepmatcher as dm
    
    ~/.local/lib/python3.8/site-packages/deepmatcher/__init__.py in <module>
          8 import sys
          9 
    ---> 10 from .data import process as data_process
         11 from .models import modules
         12 from .models.core import (MatchingModel, AttrSummarizer, WordContextualizer,
    
    ~/.local/lib/python3.8/site-packages/deepmatcher/data/__init__.py in <module>
    ----> 1 from .field import MatchingField, reset_vector_cache
          2 from .dataset import MatchingDataset
          3 from .iterator import MatchingIterator
          4 from .process import process, process_unlabeled
          5 from .dataset import split
    
    ~/.local/lib/python3.8/site-packages/deepmatcher/data/field.py in <module>
         11 import fasttext
         12 import torch
    ---> 13 from torchtext.legacy import data, vocab
         14 from torchtext.utils import download_from_url
         15 from urllib.request import urlretrieve
    
    ModuleNotFoundError: No module named 'torchtext.legacy'
    

    Please let me know how to fix this, or if you need more information.

    opened by jacobceles 7
  • geting error when train model in google colab

    geting error when train model in google colab

    i use google colab when i tried to process data with fasttext in french language i set it like this :

    train_set,validation_set = dm.data.process(
        path='drive/My Drive/recommandersystem/deepmatcher_model',
        cache='train_cache.pth',
        train='train.csv',
        validation='valid.csv',
        embeddings='fasttext.fr.bin', 
        embeddings_cache_path='drive/My Drive/recommandersystem/deepmatcher_model',
        ignore_columns=['id',''],
        id_attr='_id', 
        label_attr='label',
        left_prefix='ltable_', 
        right_prefix='rtable_')
    

    and i get this error message :

    HTTPError Traceback (most recent call last) in () 11 label_attr='label', 12 left_prefix='ltable_', ---> 13 right_prefix='rtable_')

    13 frames

    /usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs) 648 class HTTPDefaultErrorHandler(BaseHandler): 649 def http_error_default(self, req, fp, code, msg, hdrs): --> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp) 651 652 class HTTPRedirectHandler(BaseHandler):

    HTTPError: HTTP Error 403: Forbidden

    please how i can solve this

    opened by walide67 6
  • Cannot achieve the precision/recall/F1 reported in SIGMOD18

    Cannot achieve the precision/recall/F1 reported in SIGMOD18

    Hi, I have just tested deepmatcher on the walmart-amazon dataset. However, I cannot achieve the precision/recall/F1 reported in SIGMOD18.

    import deepmatcher as dm
    import logging
    import torch
    
    logging.getLogger('deepmatcher.core').setLevel(logging.INFO)
    
    model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN(word_aggregator='birnn-last-pool'))
    # model = dm.MatchingModel()
    print(model)
    model.initialize(train_dataset)  # Explicitly initialize model.
    
    model = model.cuda()
    lr_decay = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]
    batch_size = [16, 32]
    pos_neg_ratio = [10, 7, 5, 4, 3, 2, 1]
    
    best_f1 = -1
    best_params = {'lr_decay': -1, 'batch_size': -1, 'pos_neg_ratio':-1}
    for alpha in lr_decay:
        for b in batch_size:
            for rho in pos_neg_ratio:
                optimizer = dm.optim.Optimizer(lr_decay=alpha)
                model.run_train(
                    train_dataset,
                    validation_dataset,
                    epochs=15,
                    batch_size = b,
                    pos_neg_ratio=rho,
                    optimizer=optimizer,
                    best_save_path='../output/walmart-amazon/rnn_model.pth')
                print(f'lr_decay: {alpha}, batch_size:{b}, pos_neg_ratio:{rho}')
                f1 = model.run_eval(test_dataset)
                if f1 > best_f1:
                    best_f1 = f1
                    best_params['lr_decay'] = alpha
                    best_params['batch_size'] = b
                    best_params['pos_neg_ratio'] = rho
                    print(f'best f1-score: {best_f1}, {best_params}')
    print(f'best f1-score: {best_f1}, {best_params}')
    

    I haven't yet tested all the hyperparameters, but the available results show that f1-score is only about 37%. However, the f1 reported in SIGMOD18 is 67.6% for RNN on the structured walmart-amazon dataset. I use the data downloaded from the link in this repository.

    Could you tell me what the problem is? What are the hyper-parameters selected for the experiments in SIGMOD 18 (RNN)?

    opened by EliasMei 3
  • can't save model state

    can't save model state

    hello world please please i need your help the date to presenting my graduation project is nearing i get error when i tried to save model my code is :

    import deepmatcher as dm
    import pandas as pd
    import numpy as np
    import os
    train_set,validation_set = dm.data.process(
        path="deepmatcher_model/",
        cache='train_cache.pth',
        train='train.csv',
        validation='valid.csv',
        embeddings='fasttext.fr.bin',
        embeddings_cache_path="deepmatcher_model/",
        ignore_columns=['id','','ltable_index','rtable_index'],
        id_attr='_id', 
        label_attr='label',
        left_prefix='ltable_', 
        right_prefix='rtable_')
    model=dm.MatchingModel(attr_summarizer='hybrid')
    model.initialize(train_set)
    model.run_train(
        train_dataset=train_set,
        validation_dataset=validation_set,
        epochs=20,
        batch_size=16,
        best_save_path='deepmatcher_model/hybrid_model.pth',
        pos_neg_ratio=3)
    model.save_state('hybrid_model.pth',include_meta=True)
    

    the error message :

    AttributeError Traceback (most recent call last) in () ----> 1 model.save_state('hybrid_model.pth',include_meta=True)

    4 frames /usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol) 196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol) 197 pickler.persistent_id = persistent_id --> 198 pickler.dump(obj) 199 200 serialized_storage_keys = sorted(serialized_storages.keys())

    AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'

    please tell me how can i solve this i use google colab .. thanks .. and i am sorry for my english @thodrek @sidharthms @hanli91 @anhaidgroup

    opened by walide67 3
  • Can't download fasttext slovenian binaries

    Can't download fasttext slovenian binaries

    Fasttext binaries download does not work unless you select the english language. Looking at the code i presume that is due to the fact that the english binaries is located at some google drive folder, while the others will be automatically downloaded from an aws server, that it seems unreachable

    opened by belerico 3
  • error while reading csv file- invalid literal for int() with base 10: 'match'

    error while reading csv file- invalid literal for int() with base 10: 'match'

    I get an error when using dm.data.process. Setup the data and code as per the default requirements, so not sure what's wrong here.

    The csv file schema is as suggested: id, right_text, left_text, label. label contains match and non-match string entries

    train, validation, test = dm.data.process(
        path=path,
        train='train.csv',
        validation='val.csv',
        test='test.csv',
        ignore_columns=('left_id', 'right_id'),
        left_prefix='left_',
        right_prefix='right_',
        label_attr='label',
        id_attr='id')
    

    Error:

    ValueError                                Traceback (most recent call last)
    <ipython-input-33-9684d5da2beb> in <module>
          8     right_prefix='right_',
          9     label_attr='label',
    ---> 10     id_attr='id')
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/process.py in process(path, train, validation, test, unlabeled, cache, check_cached_data, auto_rebuild_cache, tokenize, lowercase, embeddings, embeddings_cache_path, ignore_columns, include_lengths, id_attr, label_attr, left_prefix, right_prefix, use_magellan_convention, pca)
        218         check_cached_data,
        219         auto_rebuild_cache,
    --> 220         train_pca=pca)
        221 
        222     # Save additional information to train dataset.
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in splits(cls, path, train, validation, test, fields, embeddings, embeddings_cache, column_naming, cache, check_cached_data, auto_rebuild_cache, train_pca, **kwargs)
        559             dataset_args = {'fields': fields, 'column_naming': column_naming, **kwargs}
        560             train_data = None if train is None else cls(
    --> 561                 path=os.path.join(path, train), **dataset_args)
        562             val_data = None if validation is None else cls(
        563                 path=os.path.join(path, validation), **dataset_args)
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in __init__(self, fields, column_naming, path, format, examples, metadata, **kwargs)
        163                 examples = [make_example(line, fields) for line in
        164                     pyprind.prog_bar(reader, iterations=lines,
    --> 165                         title='\nReading and processing data from "' + path + '"')]
        166 
        167             super(MatchingDataset, self).__init__(examples, fields, **kwargs)
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/dataset.py in <listcomp>(.0)
        161 
        162                 next(reader)
    --> 163                 examples = [make_example(line, fields) for line in
        164                     pyprind.prog_bar(reader, iterations=lines,
        165                         title='\nReading and processing data from "' + path + '"')]
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/example.py in fromCSV(cls, data, fields, field_to_index)
         64     def fromCSV(cls, data, fields, field_to_index=None):
         65         if field_to_index is None:
    ---> 66             return cls.fromlist(data, fields)
         67         else:
         68             assert(isinstance(fields, dict))
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/example.py in fromlist(cls, data, fields)
         82                         setattr(ex, n, f.preprocess(val))
         83                 else:
    ---> 84                     setattr(ex, name, field.preprocess(val))
         85         return ex
         86 
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/torchtext/legacy/data/field.py in preprocess(self, x)
        213             x = [w for w in x if w not in self.stop_words]
        214         if self.preprocessing is not None:
    --> 215             return self.preprocessing(x)
        216         else:
        217             return x
    
    ~/miniconda3/envs/dm/lib/python3.7/site-packages/deepmatcher/data/process.py in <lambda>(x)
         62         include_lengths=include_lengths)
         63     numeric_field = MatchingField(
    ---> 64         sequential=False, preprocessing=lambda x: int(x), use_vocab=False)
         65     id_field = MatchingField(sequential=False, use_vocab=False, id=True)
         66 
    
    ValueError: invalid literal for int() with base 10: 'match'
    
    opened by sidhantls 2
  • How is F1 calculated (micro, macro, weighted)?

    How is F1 calculated (micro, macro, weighted)?

    Hi

    Could you please confirm that when the following method is run:

    model.run_eval(test)

    Is the PRF1 score calculated using micro-averaging, or macro, or weighted-macro?

    Thanks

    opened by ziqizhang 2
  • SIGMOD experiments reproducibility

    SIGMOD experiments reproducibility

    Hi, I am trying to reproduce the experiments from the SIGMOD 2018 paper: http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf. I am having a hard time finding the right setup and I get results far poorer than the ones reported in the paper for most of the datasets. Can you please give me a hint regarding the right setup? For example, what are the parameters for the hybrid setup? Using the defaults leads to poor results and following the existing guides in the repository did not help much.

    As an example, for the (complete) iTunes-Amazon scenario the best I could obtain was F1: 35.09 | Prec: 33.33 | Rec: 37.04. But the paper reports better results.

    Thank you!

    opened by alex-bogatu 2
  • Why use mask_fill_ ?

    Why use mask_fill_ ?

    Hi, I have just read your codes. I found that in many scripts you use 'masked_fill_' for tensor computation, eg., word_aggregators.py:139-140 word_comparators.py:173-175 word_contexualizers.py:157-159

    Could you please tell me why you use masked_fill for tensors?

    opened by fukien 2
  • OverflowError: Python int too large to convert to C long

    OverflowError: Python int too large to convert to C long

    import deepmatcher as dm import torch

    torch.cuda.is_available() import pandas as pd pd.read_csv(r'E:\project\deepmatcher\examples\sample_data\itunes-amazon\train.csv').head()

    train, validation, test = dm.data.process( path='E:\project\deepmatcher\examples\sample_data\itunes-amazon',train='train.csv', validation='validation.csv',test='test.csv') model = dm.MatchingModel()

    Windows system How to deal with this problem?

    opened by TianyiDuan 2
  • AttributeError: module 'torchtext.data' has no attribute 'Field'

    AttributeError: module 'torchtext.data' has no attribute 'Field'

    I am trying to import deepmatcher using:

    import deepmatcher as dm

    and I am getting the following error message:

    AttributeError: module 'torchtext.data' has no attribute 'Field'

    opened by nilbahaaeddine 1
  • dm.data.preprocess no vectors found at /root/.vector_cache/wiki.en.bin

    dm.data.preprocess no vectors found at /root/.vector_cache/wiki.en.bin

    In collab:

    After installing torchtext legacy 0.11 to bypass https://github.com/anhaidgroup/deepmatcher/issues/96 ,

    When running dm.data.preprocess I get:

    Reading and processing data from 0% [############################# ] 100% | ETA: 00:00:01 Reading and processing data from 0% [############################# ] 100% | ETA: 00:00:00INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh to /root/.vector_cache/wiki.en.bin /usr/local/lib/python3.7/dist-packages/deepmatcher/data/field.py:79: ResourceWarning: unclosed <ssl.SSLSocket fd=61, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.28.0.2', 47830), raddr=('172.253.123.113', 443)> self.destination = self.backup_destination ResourceWarning: Enable tracemalloc to get the object allocation traceback INFO:deepmatcher.data.field:Extracting vectors into /root/.vector_cache


    RuntimeError Traceback (most recent call last)

    in 2 path='', 3 train='train.csv', ----> 4 validation='validation.csv')

    5 frames

    /usr/local/lib/python3.7/dist-packages/deepmatcher/data/field.py in cache(self, name, cache, url, backup_url) 94 shutil.copyfileobj(infile, outfile) 95 if not os.path.isfile(path): ---> 96 raise RuntimeError('no vectors found at {}'.format(path)) 97 98 self.model = fasttext.load_model(path)

    RuntimeError: no vectors found at /root/.vector_cache/wiki.en.bin

    I bypassed this with the solution proposed in

    https://github.com/anhaidgroup/deepmatcher/issues/57

    and now:

    • Takes more space (Crucial for limited Collab capabilities)
    • Takes 12ish minutes to download only

    But since this is not the same error produced I wanted to make sure this is known.

    opened by OneCodeToRuleThemAll 0
  • Can't find train.csv, validate.csv and test.csv

    Can't find train.csv, validate.csv and test.csv

    Hi, I am playing with "Quick Start", and got this error message:

    FileNotFoundError: [Errno 2] No such file or directory: 'data_directory\train.csv'

    I am not able to manually find train.csv, validate.csv and test.csv. Please help me where I can find these files. Thank you

    opened by xuezhongcai 0
  • Optimization of service running deepmatcher

    Optimization of service running deepmatcher

    My understanding is deepmatcher is designed for classification of pairs if they represent the same entity. In typical projects, say we have a product to add to the store -- we want to check if the product already exists in the store. During this inference, the product needs to be scored with respect to each item in the catalogue, which is computationally very expensive. What are the suggestions for optimization of this inference setup?

    opened by vadimatwork 0
  • Direct inference on pandas dataframe

    Direct inference on pandas dataframe

    Hi, I see that to make a new inference everytime, I have to save a seperate CSV and then load it by providing path to dm.data.process_unlabeled

    Is there a way to directly pass pandas dataframe to this function and perform inference without creating a new csv

    opened by rbhatia46 2
Owner
null
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 2, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 11, 2021
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 17, 2021
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 5, 2022
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 ?? News

Neural Text Matching Community 3.7k Jan 2, 2023
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 ?? News

Neural Text Matching Community 3.4k Feb 18, 2021
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 3, 2023
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

null 286 Jan 2, 2023
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022