A full spaCy pipeline and models for scientific/biomedical documents.

AI2

Last update: Jan 3, 2023

Related tags

Text Data & NLP nlp bioinformatics spacy biomedical custom-pipes scientific-documents

Overview

This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

Just looking to test out the models on your data? Check out our demo.

Installation

Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:

pip install scispacy

to install a model (see our full selection of available models below), run a command like the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy. Take a look below in the "Setting up a virtual environment" section if you need some help with this. Additionally, scispacy uses modern features of Python and as such is only available for Python 3.6 or greater.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for scispaCy. If you already have a Python 3.6 or 3.7 environment you want to use, you can skip to the 'installing via pip' section.

Follow the installation instructions for Conda.
Create a Conda environment called "scispacy" with Python 3.6:
```
conda create -n scispacy python=3.6
```
Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.
```
source activate scispacy
```

Now you can install scispacy and one of the models using the steps above.

Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

Note on upgrading

If you are upgrading scispacy, you will need to download the models again, to get the model versions compatible with the version of scispacy that you have. The link to the model that you download should contain the version number of scispacy that you have.

Available Models

To install a model, click on the link below to download the model, and then run

pip install </path/to/download>

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running

pip install CMD-V(to paste the copied URL)

Model	Description	Install URL
en_core_sci_sm	A full spaCy pipeline for biomedical data with a ~100k vocabulary.	Download
en_core_sci_md	A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors.	Download
en_core_sci_lg	A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors.	Download
en_core_sci_scibert	A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model.	Download
en_ner_craft_md	A spaCy NER model trained on the CRAFT corpus.	Download
en_ner_jnlpba_md	A spaCy NER model trained on the JNLPBA corpus.	Download
en_ner_bc5cdr_md	A spaCy NER model trained on the BC5CDR corpus.	Download
en_ner_bionlp13cg_md	A spaCy NER model trained on the BIONLP13CG corpus.	Download

Additional Pipeline Components

AbbreviationDetector

The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

You can access the list of abbreviations via the doc._.abbreviations attribute and for a given abbreviation, you can access it's long form (which is a spacy.tokens.Span) using span._.long_form, which will point to another span in the document.

Example Usage

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor

EntityLinker

The EntityLinker is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

Currently (v2.5.0), there are 5 supported linkers:

umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derrived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.
rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.
go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

You may want to play around with some of the parameters below to adapt to your use case (higher precision, higher recall etc).

resolve_abbreviations : bool = True, optional (default = False) Whether to resolve abbreviations identified in the Doc before performing linking. This parameter has no effect if there is no AbbreviationDetector in the spacy pipeline.
k : int, optional, (default = 30) The number of nearest neighbours to look up from the candidate generator per mention.
threshold : float, optional, (default = 0.7) The threshold that a mention candidate must reach to be added to the mention in the Doc as a mention candidate.
no_definition_threshold : float, optional, (default = 0.95) The threshold that a entity candidate must reach to be added to the mention in the Doc as a mention candidate if the entity candidate does not have a definition.
filter_for_definitions: bool, default = True Whether to filter entities that can be returned to only include those with definitions in the knowledge base.
max_entities_per_mention : int, optional, default = 5 The maximum number of entities which will be returned for a given mention, regardless of how many are nearest neighbours are found.

This class sets the ._.kb_ents attribute on spacy Spans, which consists of a List[Tuple[str, float]] corresponding to the KB concept_id and the associated score for a list of max_entities_per_mention number of entities.

You can look up more information for a given id using the kb attribute of this class:

print(linker.kb.cui_to_entity[concept_id])

Example Usage

import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)
>>> Name: bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])


>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0541794, Name: Skeletal muscle atrophy
>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,
                fiber diameter, force production and fatigue resistance in response to ...
>>> TUI(s): T046
>>> Aliases: (total: 9):
         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....

>>> CUI: C1447749, Name: AR protein, human
>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.
                This protein plays a role in the modulation of steroid-dependent gene transcription.
>>> TUI(s): T116, T192
>>> Aliases (abbreviated, total: 16):
         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...

Hearst Patterns (v0.3.0 and up)

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

The relation rule used to extract the hyponym (type: str)
The more general concept (type: spacy.Span)
The more specific concept (type: spacy.Span)

Usage:

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]

Citing

If you use ScispaCy in your research, please cite ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.

@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

ScispaCy is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments

pip install fails

I've created the conda env, and ran pip install scispacy see the result:

(scispacy) lucas-mbp:jats lfoppiano$ pip install scispacy
Collecting scispacy
  Using cached https://files.pythonhosted.org/packages/72/55/30b30a78abafaaf34d0d8368a090cf713964d6c97c5e912fb2016efadab0/scispacy-0.2.2-py3-none-any.whl
Collecting numpy (from scispacy)
  Downloading https://files.pythonhosted.org/packages/0f/c9/3526a357b6c35e5529158fbcfac1bb3adc8827e8809a6d254019d326d1cc/numpy-1.16.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
     |████████████████████████████████| 13.9MB 3.5MB/s 
Collecting joblib (from scispacy)
  Using cached https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl
Collecting spacy>=2.1.3 (from scispacy)
  Downloading https://files.pythonhosted.org/packages/cb/ef/cccdeb1ababb2cb04ae464098183bcd300b8f7e4979ce309669de8a56b9d/spacy-2.1.6-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (34.6MB)
     |████████████████████████████████| 34.6MB 33.6MB/s 
Collecting conllu (from scispacy)
  Downloading https://files.pythonhosted.org/packages/ae/54/b0ae1199f3d01666821b028cd967f7c0ac527ab162af433d3da69242cea2/conllu-1.3.1-py2.py3-none-any.whl
Collecting awscli (from scispacy)
  Using cached https://files.pythonhosted.org/packages/e6/48/8c5ac563a88239d128aa3fb67415211c19bd653fab01c7f11cecf015c343/awscli-1.16.203-py2.py3-none-any.whl
Collecting nmslib>=1.7.3.6 (from scispacy)
  Using cached https://files.pythonhosted.org/packages/b2/4d/4d110e53ff932d7a1ed9c2f23fe8794367087c29026bf9d4b4d1e27eda09/nmslib-1.8.1.tar.gz
    ERROR: Complete output from command python setup.py egg_info:
    ERROR: Download error on https://pypi.org/simple/numpy/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
    No local packages or working download links found for numpy
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/setup.py", line 172, in <module>
        zip_safe=False,
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
        _install_setup_requires(attrs)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
        replace_conflicting=True,
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 782, in resolve
        replace_conflicting=replace_conflicting
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1065, in best_match
        return self.obtain(req, installer)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1077, in obtain
        return installer(requirement)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
        return cmd.easy_install(req)
      File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy')
    ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/
(scispacy) lucas-mbp:jats lfoppiano$

To solve the issue I had to install numpy and nmslib:

conda install numpy
conda install -c akode nmslib

It seems to work, but maybe is not the proper way to solve it - the pip script should be updated perhaps?

opened by lfoppiano 38

Combine 'ner' model with 'core_sci' model
Hi,

I am working on a project using neuralcoref and I would like to incorporate the scispacy ner models. My hope was to use one of the ner models in combination with the core_sci tagger and dependency parser.

NeuralCoref depends on the tagger, parser, and ner.

So far I have tried this code:

cust_ner = spacy.load('en_ner_craft_md') nlp = spacy.load('en_core_sci_md') nlp.remove_pipe('ner') nlp.add_pipe(cust_ner, name="ner", last=True)

but when I pass text to the nlp object , I get the following error: TypeError: Argument 'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)

When I look at the nlp.pipeline attribute after adding the cust_ner to the pipe I see the cust_ner added as a Language object rather than a EntityRecognizer object:

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fb84976eda0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fb849516288>), ('ner', <spacy.lang.en.English object at 0x7fb853725668>)]

Before I start hacking away and writing terrible code, I thought I would reach out to see if you had any suggestions in how to accomplish what I am after?

Thanks in advance and for all that you folks do!
opened by masonedmison 26
No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

I am getting following error: Traceback (most recent call last): File "scispacy.py", line 2, in import scispacy File "/Users/shai26/office/spacy/scispacy/scispacy.py", line 5, in nlp = spacy.load("en_core_sci_sm") File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/init.py", line 21, in load return util.load_model(name, **overrides) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 114, in load_model return load_model_from_package(name, **overrides) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 134, in load_model_from_package cls = importlib.import_module(name) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/en_core_sci_sm/init.py", line 7, in from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter ModuleNotFoundError: No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

opened by sakibshaik 19
[E167] Unknown morphological feature: 'ConjType'

When I run nlp(doc) I got error: [E167] Unknown morphological feature: 'ConjType' (9141427322507498425). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date: python -m spacy validate some of the docs work while some don't.

opened by fireholder 15
kb_ents gives no results from custom KB
Following this discussion #383, where I got my custom KB to work.

I tried to test the code and for some reason it is not giving me anything. Here is the code I tested it with:

linker = CandidateGenerator(name="myCustom") text = "TR Max Velocity: 2.3 m/s" doc = nlp(text) spacy.displacy.render(doc, style = "ent", jupyter = True) entity = doc.ents[2] print("Name: ", entity) for umls_ent in entity._.kb_ents: print(umls_ent) print(linker.kb.cui_to_entity[umls_ent[0]]) print("----------------------")

This would give:

Name: m/s

there was no ---------------------- which means it did not even enter the for loop.

I was wondering why this is the case.

If this helps, this is the jsonl file that I ran this script (https://github.com/allenai/scispacy/blob/master/scripts/create_linker.py) with:

... {"concept_id": "U0013", "aliases": ["m/s"], "types": ["UN1T5"], "canonical_name": "m/s"} ...
opened by farrandi 14
aws s3 downloading

I am currently trying to train using my own corpus following the project.yml file. I try to download several files: aws s3 cp s3://ai2-s2-scispacy/data/ud_ontonotes.tar.gz assets/ud_ontonotes.tar.gz tar -xzvf assets/ud_ontonotes.tar.gz -C assets/ rm assets/ud_ontonotes.tar.gz ############################################################# aws s3 cp s3://ai2-s2-scispacy/data/med_mentions.tar.gz assets/med_mentions.tar.gz tar -xzvf assets/med_mentions.tar.gz -C assets/ rm assets/med_mentions.tar.gz ############################################################# aws s3 cp s3://ai2-s2-scispacy/data/ner/ assets --recursive --exclude '' --include '.tsv'

But it fails due to ''' fatal error: Unable to locate credentials ''' I am wondering if anyone know how to solve this problem. Thanks!!!

opened by CharlesQ9 13

Warning about incompatible spaCy models.

I get the following error when trying to load en_core_sci_sm:

UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Steps to reproduce: Create clean Conda environment and activate

conda create --name scispacy python=3.8
conda activate scispacy

Install scispacy and install the latest en_core_sci_sm model.

pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

Attempt import

(scispacy) $ python -c "import spacy; nlp=spacy.load('en_core_sci_sm')"
/home/davidw/miniconda3/envs/scispacy/lib/python3.8/site-packages/spacy/util.py:271: UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Is this warning important or can I ignore it?

Thanks,

Dave

opened by dwadden 11

DeprecationWarning from `spacy_legacy`

Hi there, I recently upgraded to spacy 3 and scispacy 0.4, but I am now getting a warning whenever I use the small scispacy model (I have not tried any other model).

I am getting a DeprecationWarning on a fresh install in python 3.8 with the latest version of scispacy and en_core_sci_sm.

Steps to reproduce:

pip install scispacy pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz

import spacy
nlp = spacy.load("en_core_sci_sm")

import warnings
warnings.filterwarnings("error")
nlp("Hello World")

Any input to the nlp model triggers the same warning:

/opt/miniconda3/envs/clean/lib/python3.8/site-packages/spacy_legacy/layers/staticvectors_v1.py in forward(model, docs, is_train)
     43     )
     44     try:
---> 45         vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
     46     except ValueError:
     47         raise RuntimeError(Errors.E896)

DeprecationWarning: Out of bound index found. 
This was previously ignored when the indexing result contained no elements. 
In the future the index error will be raised. 
This error occurs either due to an empty slice, or if an array has zero elements even before indexing.
(Use `warnings.simplefilter('error')` to turn this DeprecationWarning into an error and get more details on the invalid index.)

Any ideas as to how to resolve this without manually ignoring the warning?

bug

opened by gautierdag 10

Span is not serializable in abbreviations - figure out a better workaround

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

test = ["Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."]

print("Abbreviation", "\t", "Definition")
for doc in nlp.pipe(test, n_process=4):
    for abrv in doc._.abbreviations:
        print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

Running that code leads to this. The error message doesn't make a lot of sense, It could be because there are more processes than entries. If you remove n_process the solves the problem.

Abbreviation     Definition
Abbreviation     Definition
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\alexd\Dropbox (UFL)\UFII_COVID19_RESEARCH_TOPICS\cord19\text_parsing_pipeline\test.py", line 13, in <module>
    for doc in nlp.pipe(test, n_process=4):
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1475, in pipe
    for doc in docs:
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1511, in _multiprocessing_pipe
    proc.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

This is the error message from my main piece of code with more data. It sort of makes more sense. I think it has to something to do with how the multiprocess pipe collects the results of the workers. The error pops up after a while so it's definitely running.

Process Process-1:
Traceback (most recent call last):
  File "C:\Python38\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Python38\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "spacy\tokens\doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
  File "spacy\tokens\doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
  File "C:\Python38\lib\site-packages\spacy\util.py", line 1134, in to_dict
    serialized[key] = getter()
  File "spacy\tokens\doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
  File "C:\Python38\lib\site-packages\srsly\_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "C:\Python38\lib\site-packages\srsly\msgpack\__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly\msgpack\_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly\msgpack\_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.span.Span' object

Running spacy 3.0, the latest version, and on Windows 10.

bug help wanted

opened by f0lie 10

How to visualize named entities in custom colors

There's an options in Spacy which allows us to use custom colors for named entity visualization. I'm trying to use the same options in scispacy for the named entities. I simply created two lists of entities and randomly generated colors and put them in options dictionary like the following:

options = {"ents": entities, "colors": colors}

Where entities is a list of NEs in scispacy NER models and colors is a list of the same size. But using such an option in either displacy.serve or displacy.render (for jupyter) does not work. I'm using the options like the following:

displacy.serve(doc, style="ent", options=options)

I wonder if using the color option only works for predefined named entities in the Spacy or there's something wrong with the way I'm using the option?

opened by phosseini 10
What does Doc.tensor contain for non-transformer models?

Hi, we are processing large amounts of text and need to serialize Doc objects efficiently. We are using the sci_md model, and it appears that when converting a Doc to bytes, the majority of the space is taken by the Doc.tensor data. What does that data represent exactly? Is it static, and/or do I have to include it in each serialized Doc object?

opened by ldorigo 9

UserWarning: [W036] The component 'matcher' does not have any patterns defined.

Hello,

Happy Holidays!

I used the last sentence from the example on your README.md file :

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

Here's my code:

import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_ner_craft_md")
nlp.add_pipe("abbreviation_detector")
nlp.add_pipe(
            "scispacy_linker",
            config={"resolve_abbreviations": True, "linker_name":"mesh"},
        )
doc = self.nlp("SBMA can be caused by this easily.") # from the scispacy example

I get the following error:

../site-packages/scispacy/abbreviation.py:230: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  global_matches = self.global_matcher(doc)

Any guidance would be greatly appreciated!

scispacy                  0.5.1  
spacy                     3.4.4

opened by hrshdhgd 1

"Mesh" and "Hpo" linkers give the same result

Hi, I'm trying to annotate data using Scispacy. Loading "mesh" and "hpo" gives the exact same results no matter what is the input. For example:

I tried on many texts and both linkers plotted the same results.

opened by almogmor 6

incompatability error when installing en_core_sci_sm

I ran `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz`
and got this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scispacy 0.4.0 requires spacy<3.1.0,>=3.0.0, but you have spacy 3.4.4 which is incompatible.
en-core-web-sm 3.0.0 requires spacy<3.1.0,>=3.0.0, but you have spacy 3.4.4 which is incompatible.
docanalysis 0.2.0 requires spacy==3.0.7, but you have spacy 3.4.4 which is incompatible.

opened by EmanuelFaria 1

entity recognition doesn't recognize locations
Hi, Thank you for this wonderful library! Trying to use 'en_core_sci_lg' for simple entity recognition task, not sure if I'm missing something in the setup or it's a bug, would appreciate the help. This is the output of an example from spicy documentation.

when trying this:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_)

the result is -

Apple 0 5 ORG U.K. 27 31 GPE $1 billion 44 54 MONEY

**but when trying the same code with en_core_sci_lg - **

import spacy nlp = spacy.load('en_core_sci_lg') doc = nlp("Apple is looking at buying U.K. startup for $1 billion") for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_)

the result is -

Apple 0 5 ENTITY U.K. 27 31 ENTITY startup 32 39 ENTITY

working on google colab, installed the following - `! pip install spacy

! pip install scispacy

! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz`

Thank you!
opened by maayansharon10 1
Parsed Entity linked incorrectly to UMLS concept

Hi,

I'm parsing text from clinicaltrials.gov (Trial ID NCT04837209) using scispaCy plus language model 'en_core_sci_md' and seeing 'Dostarlimab' being linked to UMLS concept C1621793 which is a bird (a Starling).

It looks like this is the result of fuzzy matching - both words have a substring ('starlit') in common - as evident by the low match probability (0.5594).

However, the biologic drug Dostarlimab is in the latest UMLS release (2022AB) as the concept C5242455. Is scispaCy linking to an older version of UMLS?

Thanks, Ron

opened by rxk2rxk 2
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0

Releases(v0.5.1)

v0.5.1(Sep 7, 2022)

Retrains the models with spacy 3.4.x to be compatible with the latest spacy version
Source code(tar.gz)
Source code(zip)
v0.5.0(Mar 10, 2022)

Updates scispacy to be compatiable with the latest spacy version (3.2.3)
Source code(tar.gz)
Source code(zip)
v0.4.0(Feb 12, 2021)

This release of scispacy is compatible with Spacy 3. It also includes a new model 🥳 , en_core_sci_scibert, which uses scibert base uncased to do parsing and POS tagging (but not NER, yet. This will come in a later release).
Source code(tar.gz)
Source code(zip)
v0.3.0(Oct 16, 2020)
New Features

Hearst Patterns

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

The relation rule used to extract the hyponym (type: str)

The more general concept (type: spacy.Span)

The more specific concept (type: spacy.Span)

Usage:

import spacy from scispacy.hyponym_detector import HyponymDetector nlp = spacy.load("en_core_sci_sm") hyponym_pipe = HyponymDetector(nlp, extended=True) nlp.add_pipe(hyponym_pipe, last=True) doc = nlp("Keystone plant species such as fig trees are good for the soil.") print(doc._.hearst_patterns) >>> [('such_as', Keystone plant species, fig trees)]

Ontonotes Mixin: Clear Format > UD

Thanks to Yoav Goldberg for this fix! Yoav noticed that the dependency labels for the Onotonotes data use a different format than the converted GENIA Trees. Yoav wrote some scripts to convert between them, including normalising of some syntactic phenomena that were being treated inconsistently between the two corpora.

Bug Fixes

#252 - removed duplicated aliases in the entity linkers, reducing the size of the UMLS linker by ~10% #249 - fix the path to the rxnorm linker
Source code(tar.gz)
Source code(zip)
v0.2.5(Jul 8, 2020)
New Features 🥇

New Models

Models compatible with Spacy 2.3.0 🥳

Entity Linkers

#246, #233

Updated the UMLS KB to use the 2020AA release, categories 0,1,2,9.

umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.

mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derrived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.

rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.

hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

Bug Fixes 🐛

#217 - Fixes a bug in the Abbreviation detector

API Changes

Entity Linkers now modify the Span._.kb_ents rather than the Span._.umls_ents to reflect the fact that we now have more than one entity linker. Span._.umls_ents will be deprecated in v1.0.

Source code(tar.gz)
Source code(zip)
v0.2.4(Oct 23, 2019)

Retrains the models to be compatible with spacy 2.2.1 and rewrites the optional sentence splitting pipe to use pysbd. This pipe is experimental at this point and may be rough around the edges.
Source code(tar.gz)
Source code(zip)
v0.2.2(Jun 3, 2019)

Adds entity linking and abbreviation detection.
Source code(tar.gz)
Source code(zip)
v0.2.0(Apr 3, 2019)

Update spacy version to 2.1.3
Source code(tar.gz)
Source code(zip)

A full spaCy pipeline and models for scientific/biomedical documents.

Related tags

Overview

Installation

Setting up a virtual environment

Note on upgrading

Available Models

Additional Pipeline Components

AbbreviationDetector

Example Usage

EntityLinker

Example Usage

Hearst Patterns (v0.3.0 and up)

Usage:

Citing

Comments

Steps to reproduce:

Patching CVE-2007-4559

Releases(v0.5.1)

v0.5.1(Sep 7, 2022)

v0.5.0(Mar 10, 2022)

v0.4.0(Feb 12, 2021)

v0.3.0(Oct 16, 2020)

New Features

Hearst Patterns

Usage:

Ontonotes Mixin: Clear Format > UD

Bug Fixes

v0.2.5(Jul 8, 2020)

New Features 🥇

New Models

Entity Linkers

Bug Fixes 🐛

API Changes

v0.2.4(Oct 23, 2019)

v0.2.2(Jun 3, 2019)

v0.2.0(Apr 3, 2019)

Owner

AI2

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

DaCy: The State of the Art Danish NLP pipeline using SpaCy

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

NLP, before and after spaCy

NLP, before and after spaCy

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

NLP, before and after spaCy

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

🧪 Cutting-edge experimental spaCy components and features

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

✨Fast Coreference Resolution in spaCy with Neural Networks

spaCy plugin for Transformers , Udify, ELmo, etc.

✨Fast Coreference Resolution in spaCy with Neural Networks

spaCy plugin for Transformers , Udify, ELmo, etc.

SpikeX - SpaCy Pipes for Knowledge Extraction

Augmenty is an augmentation library based on spaCy for augmenting texts.

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata