A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Overview

spaCyOpenTapioca

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata.

Table of contents

Installation

pip install spacyopentapioca

or

git clone https://github.com/UB-Mannheim/spacyopentapioca
cd spacyopentapioca/
pip install .

How to use

After installation the OpenTapioca pipeline can be used without any other pipelines:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Germany.")
for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 3.6533377082098895)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.1099332471902863)

The types and aliases are also available:

for span in doc.ents:
    print((span._.types, span._.aliases[0:5]))
({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}, ['IJalimani', 'R. F. A.', 'Alemania', '도이칠란트', 'Germaniya'])

The Wikidata QIDs are attached to tokens:

for token in doc:
    print((token.text, token.ent_kb_id_))
('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Germany', 'Q183')
('.', '')

The raw response of the OpenTapioca API can be accessed in the doc- and span-objects:

raw_annotations1 = doc._.annotations
raw_annotations2 = [span._.annotations for span in doc.ents]

The partial metadata for the response returned by the OpenTapioca API is

doc._.metadata

All span-extensions are:

span._.annotations
span._.description
span._.aliases
span._.rank
span._.score
span._.types
span._.label
span._.extra_aliases
span._.nb_sitelinks
span._.nb_statements

Note that spaCyOpenTapioca does a tiny processing of entities appearing in doc.ents. All entities returned by OpenTapioca can be found in doc.spans['all_entities_opentapioca'].

Local OpenTapioca

If OpenTapioca is deployed locally, specify the URL of the new OpenTapioca API in the config:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca', config={"url": OpenTapiocaAPI})
doc = nlp("Christian Drosten works in Germany.")

Vizualization

NER vizualization in spaCy via displaCy cannot show yet the links to entities. This can be added into spaCy as proposed in issue 9129.

Comments
  • AttributeError: 'NoneType' object has no attribute 'text' when using nlp.pipe()

    AttributeError: 'NoneType' object has no attribute 'text' when using nlp.pipe()

    Hi, when I process multiple text documents as a batch, I have failure with the error message: AttributeError: 'NoneType' object has no attribute 'text'. However, processing each text document by itself produces no such error. Here is a easy to reproduce example:

    docs = ["""String of 126 characters. String of 126 characters. String of 126 characters. String of 126 characters. String of 126 characte""","""Any string which is 93 characters. Any string which is 93 characters. Any string which is 93 """]
    nlp = spacy.blank("en")
    nlp.add_pipe("opentapioca")
    for doc in nlp.pipe(docs):
        print(doc)
    

    Fulll stack trace below:

    AttributeError                            Traceback (most recent call last)
    <command-370658210397732> in <module>
          4 nlp = spacy.blank("en")
          5 nlp.add_pipe("opentapioca")
    ----> 6 for doc in nlp.pipe(docs):
          7     print(doc)
    
    /databricks/python/lib/python3.8/site-packages/spacy/language.py in pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
       1570         else:
       1571             # if n_process == 1, no processes are forked.
    -> 1572             docs = (self._ensure_doc(text) for text in texts)
       1573             for pipe in pipes:
       1574                 docs = pipe(docs)
    
    /databricks/python/lib/python3.8/site-packages/spacy/util.py in _pipe(docs, proc, name, default_error_handler, kwargs)
       1597     if hasattr(proc, "pipe"):
       1598         yield from proc.pipe(docs, **kwargs)
    -> 1599     else:
       1600         # We added some args for pipe that __call__ doesn't expect.
       1601         kwargs = dict(kwargs)
    
    /databricks/python/lib/python3.8/site-packages/spacyopentapioca/entity_linker.py in pipe(self, stream, batch_size)
        117                     self.make_request, doc): doc for doc in docs}
        118                 for doc, future in zip(docs, concurrent.futures.as_completed(future_to_url)):
    --> 119                     yield self.process_single_doc_after_call(doc, future.result())
    
    /databricks/python/lib/python3.8/site-packages/spacyopentapioca/entity_linker.py in process_single_doc_after_call(self, doc, r)
         66                                      alignment_mode='expand')
         67                 log.warning('The OpenTapioca-entity "%s" %s does not fit the span "%s" %s in spaCy. EXPANDED!',
    ---> 68                             ent['tags'][0]['label'][0], (start, end), span.text, (span.start_char, span.end_char))
         69             span._.annotations = ent
         70             span._.description = ent['tags'][0]['desc']
    
    AttributeError: 'NoneType' object has no attribute 'text'
    

    I don't know what about the lengths of the strings causes an issue, but they do seem to matter in some way. Adding or removing a couple characters from either string can resolve the issue.

    opened by coltonpeltier-db 6
  • Add methods to highlights

    Add methods to highlights

    In the same way by clicking a NER highlighting leads to a web side it would perhaps be possible to extend this functionality and pass a method to be run when clicking the highlighted NER.

    opened by joseberlines 4
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi UB-Mannheim/spacyopentapioca!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 1
  • 'ent_kb_id' referenced before assignment

    'ent_kb_id' referenced before assignment

    Hello, while trying this example : nlp("M. Knajdek"), An error occurs in the entity_linker.py file UnboundLocalError: local variable 'ent_kb_id' referenced before assignment on line 67 in the file. This is due to the . separator.

    opened by TheNizzo 1
  • Added logging & Fixed Reference Error

    Added logging & Fixed Reference Error

    Added logger to allow user to suppress logs coming from spacyopentapioca.

    Fixed thelocal variable 'etype' referenced before assignment error at line 65.

    opened by jordanparker6 1
Releases(v.0.1.6)
Owner
Universitätsbibliothek Mannheim
Mannheim University Library
Universitätsbibliothek Mannheim
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

null 2 Oct 3, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Crosslingual Coreference Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non

Pandora Intelligence 71 Jan 4, 2023
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains.

Zihan Liu 89 Nov 10, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022