Blue Brain text mining toolbox for semantic search and structured information extraction

Overview

Blue Brain Search

Source Code DOI Source code DOI
Data & Models DOI Data & Models DOI
Documentation Docs
Latest Release PyPI
Python Versions Python Versions
License License
Build Status Build status
Static Typing Mypy
Code Style Black Isort Pydocstyle Pydocstyle
Security Bandit

Blue Brain Search is a text mining toolbox to perform semantic literature search and structured information extraction from text sources.

This repository originated from the Blue Brain Project efforts on exploring and mining the CORD-19 dataset.

Graphical Interface

The graphical interface is composed of widgets to be used in Jupyter notebooks.

For the graphical interface to work, the steps of the Getting Started should have been completed successfully.

Find documents based on sentence semantic similarity

Search Widget

To find sentences semantically similar to the query 'Glucose is a risk factor for COVID-19' in the documents, you could just click on the blue button named Search Literature!. You could also enter the query of your choice by editing the text in the top field named Query.

The returned results are ranked by decreasing semantic similarity. This means that the first results have a similar meaning to the query. Thanks to the state-of-the-art approach based on deep learning used by Blue Brain Search, this is true even if the query and the sentences from the documents do not share the same words (e.g. they are synonyms, they have a similar meaning, ...).

Extract structured information from documents

The extraction could be done either on documents found by the search above or on the text content of a document pasted in the widget.

Found documents

Mining Widget (articles)

To extract structured information from the found documents, you could just click on the blue button named Mine Selected Articles!.

At the moment, the returned results are named entities. For each named entity, the structured information is: the mention (e.g. 'COVID-19'), the type (e.g. 'DISEASE'), and its location up to the character in the document.

Pasted document content

Mining Widget (text)

It is also possible to extract structured information from the pasted content of a document. To switch to this mode, you could just click on the tab named Mine Text. Then, you could launch the extraction by just clicking on the blue button named Mine This Text!. You could also enter the content of your choice by editing the text field.

Getting Started

There are 8 steps which need to be done in the following order:

  1. Prerequisites
  2. Retrieve the documents
  3. Initialize the database server
  4. Install Blue Brain Search
  5. Create the database
  6. Compute the sentence embeddings
  7. Create the mining cache
  8. Initialize the search, mining, and notebook servers
  9. Open the example notebook

Before proceeding, four things need to be noted.

First, these instructions are to reproduce the environment and results of Blue Brain Search v0.1.0. Indeed, this is the version for which the models we have trained have been publicly released.

Second, the setup of Blue Brain Search requires the launch of 4 servers (database, search, mining, notebook). The instructions are supposed to be executed on a powerful remote machine and the notebooks are supposed to be accessed from a personal local machine through the network.

Third, the ports, the Docker image names, and the Docker container names are modified (see below) to safely test the instructions on a machine where the Docker images would have already been built, the Docker containers would already run, and the servers would already run.

Fourth, if you are in a production setting, the database password and the notebook server token should be changed, the prefix test_ should be removed from the Docker image and container names, the sed commands should be omitted, and the second digit of the ports should be replaced by 8.

Prerequisites

The instructions are written for GNU/Linux machines. However, any machine with the equivalent of git, wget, tar, cd, mv, mkdir, sed (optional), and echo could be used.

The software named Docker is also needed. To install Docker, please refer to the official Docker documentation.

An optional part is using the programming language Python and its package manager pip. To install Python and pip please refer to the official Python documentation.

Otherwise, let's start in a newly created directory.

First, download the snapshot of the DVC remote and extract it.

wget https://zenodo.org/record/4589007/files/bbs_dvc_remote.tar.gz
tar xf bbs_dvc_remote.tar.gz

Second, clone the Blue Brain Search repository for v0.1.0.

git clone --depth 1 --branch v0.1.0 https://github.com/BlueBrain/Search.git

Third, keep track of the path to the working directory, the repository directory, and the data and models directory.

export WORKING_DIRECTORY="$(pwd)"
export REPOSITORY_DIRECTORY="$WORKING_DIRECTORY/Search"
export BBS_DATA_AND_MODELS_DIR="$REPOSITORY_DIRECTORY/data_and_models"

Finally, define the configuration common to all the instructions.

export DATABASE_PORT=8953
export SEARCH_PORT=8950
export MINING_PORT=8952
export NOTEBOOK_PORT=8954

export DATABASE_PASSWORD=1234
export NOTEBOOK_TOKEN=1a2b3c4d

export USER_NAME=$(id -un)
export USER_ID=$(id -u)

export http_proxy=http://bbpproxy.epfl.ch:80/
export https_proxy=http://bbpproxy.epfl.ch:80/

Retrieve the documents

This will download and decompress the CORD-19 version corresponding to the version 73 on Kaggle. Note that the data are around 7 GB. Decompression would take around 3 minutes.

export CORD19_VERSION=2021-01-03
export CORD19_ARCHIVE=cord-19_${CORD19_VERSION}.tar.gz
export CORD19_DIRECTORY=$WORKING_DIRECTORY/$CORD19_VERSION
cd $WORKING_DIRECTORY
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/$CORD19_ARCHIVE
tar xf $CORD19_ARCHIVE
cd $CORD19_DIRECTORY
tar xf document_parses.tar.gz

CORD-19 contains more than 400,000 publications. The next sections could run for several hours, even days, depending on the power of the machine.

For testing purposes, you might want to consider a subset of the CORD-19. The following code select around 1,400 articles about glucose and risk factors:

mv metadata.csv metadata.csv.original
pip install pandas
python
import pandas as pd
metadata = pd.read_csv('metadata.csv.original')
sample = metadata[
    metadata.title.str.contains('glucose', na=False)
    | metadata.title.str.contains('risk factor', na=False)
  ]
print('The subset contains', sample.shape[0], 'articles.')
sample.to_csv('metadata.csv', index=False)
exit()

Initialize the database server

export DATABASE_NAME=cord19
export DATABASE_URL=$HOSTNAME:$DATABASE_PORT/$DATABASE_NAME

This will build a Docker image where MySQL is installed.

cd $REPOSITORY_DIRECTORY
docker build \
  --build-arg http_proxy \
  --build-arg https_proxy  \
  -f docker/mysql.Dockerfile -t test_bbs_mysql .

NB:HTTP_PROXY and HTTPS_PROXY, in upper case, are not working here.

This will launch using this image a MySQL server running in a Docker container.

docker run \
  --publish $DATABASE_PORT:3306 \
  --env MYSQL_ROOT_PASSWORD=$DATABASE_PASSWORD \
  --detach \
  --name test_bbs_mysql test_bbs_mysql

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

docker exec --interactive --tty test_bbs_mysql bash
mysql -u root -p

Please replace <database name> by the value of DATABASE_NAME.

CREATE DATABASE <database name>;
CREATE USER 'guest'@'%' IDENTIFIED WITH mysql_native_password BY 'guest';
GRANT SELECT ON <database name>.* TO 'guest'@'%';
exit;

Please exit the interactive session on the test_bbs_mysql container.

exit

Install Blue Brain Search

This will build a Docker image where Blue Brain Search is installed.

cd $REPOSITORY_DIRECTORY
docker build \
  --build-arg BBS_HTTP_PROXY=$http_proxy \
  --build-arg BBS_http_proxy=$http_proxy \
  --build-arg BBS_HTTPS_PROXY=$https_proxy \
  --build-arg BBS_https_proxy=$https_proxy \
  --build-arg BBS_USERS="$USER_NAME/$USER_ID" \
  -f docker/base.Dockerfile -t test_bbs_base .

NB: At the moment, HTTP_PROXY, HTTPS_PROXY, http_proxy, and https_proxy are not working here.

This will launch using this image an interactive session in a Docker container.

The immediate next sections will need to be run in this session.

docker run \
  --volume /raid:/raid \
  --env REPOSITORY_DIRECTORY \
  --env CORD19_DIRECTORY \
  --env WORKING_DIRECTORY \
  --env DATABASE_URL \
  --env BBS_DATA_AND_MODELS_DIR \
  --gpus all \
  --interactive \
  --tty \
  --rm \
  --user "$USER_NAME" \
  --name test_bbs_base test_bbs_base
cd $REPOSITORY_DIRECTORY
pip install .[data_and_models]

NB: The optional dependencies installed with the [data_and_models] option are only necessary if you want to execute training or inference using the dvc and the model and scripts contained under data_and_models/. If this is not the case, you can skip the [data_and_models] at the end of pip install.

Then, configure DVC to work with the downloaded snapshot of the DVC remote.

dvc remote add --default local $WORKING_DIRECTORY/bbs_dvc_remote

Create the database

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

If you are using the CORD-19 subset of around 1,400 articles, this would take around 3 minutes.

create_database \
  --cord-data-path $CORD19_DIRECTORY \
  --db-url $DATABASE_URL

Compute the sentence embeddings

If you are using the CORD-19 subset of around 1,400 articles, this would take around 2 minutes (on 2 Tesla V100 16 GB).

export EMBEDDING_MODEL='BioBERT NLI+STS CORD-19 v1'
export BBS_SEARCH_EMBEDDINGS_PATH=$WORKING_DIRECTORY/embeddings.h5
cd $BBS_DATA_AND_MODELS_DIR/models/sentence_embedding/
dvc pull biobert_nli_sts_cord19_v1
compute_embeddings SentTransformer $BBS_SEARCH_EMBEDDINGS_PATH \
  --checkpoint biobert_nli_sts_cord19_v1 \
  --db-url $DATABASE_URL \
  --gpus 0,1 \
  --h5-dataset-name "$EMBEDDING_MODEL" \
  --n-processes 2

NB: At the moment, compute_embeddings handles more models than the search server. The supported models for the search could be found in SearchServer._get_model(...).

Create the mining cache

cd $BBS_DATA_AND_MODELS_DIR/pipelines/ner/
dvc pull $(< dvc.yaml grep -oE '\badd_er_[0-9]+\b' | xargs)

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

If you are using the CORD-19 subset of around 1,400 articles, this would take around 4 minutes.

cd $REPOSITORY_DIRECTORY
create_mining_cache \
  --db-url $DATABASE_URL \
  --target-table-name=mining_cache

NB: By default, the logging level is set to show the INFO logs. Note also that the command cd $REPOSITORY_DIRECTORY above is essential as otherwise the mining models will not be found.

Initialize the search, mining, and notebook servers

Please exit the interactive session of the test_bbs_base container.

exit
cd $REPOSITORY_DIRECTORY

Search server

sed -i 's/ bbs_/ test_bbs_/g' docker/search.Dockerfile
docker build \
  -f docker/search.Dockerfile -t test_bbs_search .

Please export also in this environment the variables EMBEDDING_MODEL and BBS_SEARCH_EMBEDDINGS_PATH.

export BBS_SEARCH_DB_URL=$DATABASE_URL
export BBS_SEARCH_MYSQL_USER=guest
export BBS_SEARCH_MYSQL_PASSWORD=guest

export BBS_SEARCH_MODELS_PATH=$BBS_DATA_AND_MODELS_DIR/models/sentence_embedding/
export BBS_SEARCH_MODELS=$EMBEDDING_MODEL
docker run \
  --publish $SEARCH_PORT:8080 \
  --volume /raid:/raid \
  --env BBS_SEARCH_DB_URL \
  --env BBS_SEARCH_MYSQL_USER \
  --env BBS_SEARCH_MYSQL_PASSWORD \
  --env BBS_SEARCH_MODELS \
  --env BBS_SEARCH_MODELS_PATH \
  --env BBS_SEARCH_EMBEDDINGS_PATH \
  --detach \
  --name test_bbs_search test_bbs_search

Mining server

sed -i 's/ bbs_/ test_bbs_/g' docker/mining.Dockerfile
docker build \
  -f docker/mining.Dockerfile -t test_bbs_mining .
export BBS_MINING_DB_TYPE=mysql
export BBS_MINING_DB_URL=$DATABASE_URL
export BBS_MINING_MYSQL_USER=guest
export BBS_MINING_MYSQL_PASSWORD=guest
docker run \
  --publish $MINING_PORT:8080 \
  --volume /raid:/raid \
  --env BBS_MINING_DB_TYPE \
  --env BBS_MINING_DB_URL \
  --env BBS_MINING_MYSQL_USER \
  --env BBS_MINING_MYSQL_PASSWORD \
  --detach \
  --name test_bbs_mining test_bbs_mining

Notebook server

The structured information searched and extracted using the text mining tools provided by Blue Brain Seach can be conveniently transformed and analyzed as a knowledge graph using the tools provided by Blue Brain Graph.

To use the complete pipeline—composed of literature search, text mining, and transformed into a knowledge graph-you should use the proof of concept notebook BBS_BBG_poc.ipynb from our dedicated repository. In order to use such notebook, please follow the instructions from the dedicated README.

If you want to setup the notebook in a docker container, please create an environment variable called NOTEBOOK_DIRECTORY and launch the following command:

export NOTEBOOK_DIRECTORY="$WORKING_DIRECTORY/Search-Graph-Examples"
docker run \
  --publish $NOTEBOOK_PORT:8888 \
  --volume /raid:/raid \
  --env NOTEBOOK_TOKEN \
  --env DB_URL \
  --env SEARCH_ENGINE_URL \
  --env TEXT_MINING_URL \
  --interactive \
  --tty \
  --rm \
  --user "$USER_NAME" \
  --workdir $NOTEBOOK_DIRECTORY \
  --name test_bbs_notebook test_bbs_base

Do not hesitate to check Blue Brain Search-Graph-Examples repository for any encountered issues linked to the notebook.

Please hit CTRL+P and then CTRL+Q to detach from the Docker container.

Open the example notebook

echo http://$HOSTNAME:$NOTEBOOK_PORT/lab/tree/BBS_BBG_poc.ipynb?token=$NOTEBOOK_TOKEN

To open the example notebook, please open the link returned above in a browser.

Voilà! You could now use the graphical interface.

Clean-up

Please note that this will DELETE ALL what was done in the previous sections of this Getting Started. This could be useful to do so after having tried the instructions or when something went bad.

export SERVERS='test_bbs_search test_bbs_mining test_bbs_mysql'
docker stop test_bbs_notebook $SERVERS
docker rm $SERVERS
docker rmi $SERVERS test_bbs_base
rm $BBS_SEARCH_EMBEDDINGS_PATH
rm -R $CORD19_DIRECTORY
rm $WORKING_DIRECTORY/$CORD19_ARCHIVE
rm -R $REPOSITORY_DIRECTORY

Installation (virtual environment)

We currently support the following Python versions. Make sure you are using one of them.

  • Python 3.7
  • Python 3.8
  • Python 3.9

Before installation, please make sure you have a recent pip installed (>=19.1)

pip install --upgrade pip

Then you can easily install bluesearch from PyPI:

pip install bluesearch[data_and_models]

You can also build from source if you prefer:

pip install .[data_and_models]

Installation (Docker)

We provide a docker file, docker/Dockerfile that allows to build a docker image with all dependencies of bluesearch pre-installed. Note that bluesearch itself is not installed, which needs to be done manually on each container that is spawned.

To build the docker image open a terminal in the root directory of the project and run the following command.

$ docker build -f docker/Dockerfile -t bbs .

Then, to spawn an interactive container session run

$ docker run -it --rm bbs

Documentation

We provide additional information on the package in the documentation. All the versions of our documentation, both stable and latest, can be found on Read the Docs.

If you want to manually build the documentation, you can do so using Sphinx. Make sure to install the bluesearch package with dev extras to get the necessary dependencies.

pip install -e .[dev]

Then, to generate the documentation run

cd docs
make clean && make html

You can open the resulting documentation in a browser by navigating to docs/_build/html/index.html.

Testing

We use tox to run all our tests. Running tox in the terminal will execute the following environments:

  • lint: code style and documentation checks
  • docs: test doc build
  • check-packaging: test packaging
  • py37: run unit tests (using pytest) with python3.7
  • py38: run unit tests (using pytest) with python3.8
  • py39: run unit tests (using pytest) with python3.9

Each of these environments can be run separately using the following syntax:

$ tox -e lint

This will only run the lint environment.

We provide several convenience tox environments that are not run automatically and have to be triggered by hand:

  • format
  • benchmarks

The format environment will reformat all source code using isort and black.

The benchmark environment will run pre-defined pytest benchmarks. Currently these benchmarks only test various servers and therefore need to know the server URL. These can be passed to tox via the following environment variables:

export EMBEDDING_SERVER=http://<url>:<port>
export MINING_SERVER=http://<url>:<port>
export MYSQL_SERVER=<url>:<port>
export SEARCH_SERVER=http://<url>:<port>

If a server URL is not defined, then the corresponding tests will be skipped.

It is also possible to provide additional positional arguments to pytest using the following syntax:

$ tox -e benchmarks -- <positional arguments>

for example:

$ tox -e benchmarks -- \
  --benchmark-histogram=my_histograms/benchmarks \
  --benchmark-max-time=1.5 \
  --benchmark-min-rounds=1

See pytest --help for additional options.

Funding & Acknowledgment

This project was supported by funding to the Blue Brain Project, a research center of the Ecole polytechnique fédérale de Lausanne, from the Swiss government's ETH Board of the Swiss Federal Institutes of Technology.

COPYRIGHT (c) 2021 Blue Brain Project/EPFL

Comments
  • #355 Integrate Sentence Embedding training and fine-tuning in DVC pipeline.

    #355 Integrate Sentence Embedding training and fine-tuning in DVC pipeline.

    Fixes #355.

    Note

    Before running the whole pipeline (evaluation@biobert_nli_sts_cord19_v1), please read this comment.

    Description

    After #343 we found out that in fact some reproducibility issues are related just to torch.save(). Since, torch==1.9.0 got released, this version contains the patch resolving this reproducibility issue.

    First step of this PR was to check that training and fine-tuning our Sentence Embedding model is now reproducible. It is indeed the case (see How to check reproducibility? to reproduce the experiment).

    As it is now reproducible, this PR handles the integration of those training and fine-tuning steps into the DVC pipeline. What is done during this PR:

    • Rename scripts to training_transformers for more clarity
    • Update dvc.yaml file containing now two news steps (i.e. training_transformers and fine_tuning_transformers)
    • Update requirements.txt and setup.py with the new release of torch==1.9.0.

    Small question still to answer: Should we remove build.sh and the reference of it in the README.md as now the training is handled by DVC ?

    How to check reproducibility?

    First step of this PR was to check that training and fine-tuning our Sentence Embedding model is now reproducible. Here are steps followed:

    cd data_and_models/pipelines/sentence_embedding/scripts/
    # Create a copy of sentences-filtered_11-527-877.txt with a sample of all sentences
    sed -n '1,100000p' sentences-filtered_11-527-877.txt > sentences-filtered_11-527-877_sample.txt
    
    # Make sure you have the last version of torch (1.9.0) and the good version of transformers
    pip install --upgrade torch
    pip install transformers==3.4.0 
    
    # Launch the scripts after changing the TRAIN environment variable in the script and also the output directory contained under TEMP environment variable
    ./build.sh
    

    You need to launch this script twice. Output directory being part of the training arguments. The final binary files saving those arguments are going to be different if the output directory between the two runs is also different. However, the binary file containing the weights is now fully reproducible.

    How to test?

    Please provide here instructions on how to test the changes introduced by this PR. (if some changes cannot be tested by automated tests)

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [x] Documentation and whatsnew.rst updated. (if needed)
    • [X] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [X] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    🦉 dvc 
    opened by EmilieDel 27
  • [BBS 199] Upgrade torch + transformers and investigate MP start method

    [BBS 199] Upgrade torch + transformers and investigate MP start method

    JIRA: BBS-199

    TODO

    • [x] Change requirements.txt

    • [x] Check manually whether multiprocessing inside of compute_embeddings works (we do not have unittests that actual run multiprocessing). @pafonta feel free to find a failure case.

      • There is an issue, the compute_embeddings does not work correctly for newer version of transformers. See huggingface/transformers#8801.
    • [ ] Build base docker image (will do it once merged)

    opened by jankrepl 21
  • Add knowledge graph building process steps

    Add knowledge graph building process steps

    Hello,

    With @annakristinkaufmann, we have added the process steps for knowledge graph building to the BBS BBG PoC notebook.

    We are proposing the variable table as a way to have the two notebook sections connected. This variable could of course be renamed.

    opened by pafonta 20
  • Add knowledge graph data model and RDF graph

    Add knowledge graph data model and RDF graph

    Hello,

    This is a first iteration on building a knowledge graph from the output of the NERs and REs.

    For this first iteration, the data model represents and enables the semantic search of the recognized entities and their provenance. Real example data are represented with this RDF data model and are loaded in data structure understanding RDF and operations on it.

    The next iteration will improve the semantic representation of the data.

    opened by pafonta 18
  • Use individual spaCy with transformer backbones for all NER models

    Use individual spaCy with transformer backbones for all NER models

    🚀 Feature

    In light of what we discussed in PR #328, and in particular looking at the results shown in this comparison table, we should operate the following changes to the NER models in data_and_models/pipelines/ner.

    • [x] All NER models should use a transformer backbone. Now they are using tok2vec.
    • [x] All NER models should initialize the weights of this backbone using the pre-trained weights of CORD19 NLI+STS v1. Also, the weights should not be frozen during the fine-tuning on the NER task.
    • [x] There should be one distinct NER model (= spaCy pipeline) for each entity type we support. Note that currently this is not the case, as e.g. model2 is used to extract 3 different entity types (see table here).
    • [x] Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component (Note: for the moment let's keep using add_er.py, then in the future we'll improve that with #310).
    • [x] All the evaluation results (token and entity based, Prec, Rec, F1) obtained before and after this PR should be collected in a table.
    🔤 named-entity-recognition 
    opened by FrancescoCasalegno 17
  • First draft for NER models improvement processes

    First draft for NER models improvement processes

    Context

    As we have been requested, it is of highest importance that not only our NER models improve their accuracy, but also that we implement features and define processes make it as seamless as possible to improve our NER models by allowing users to address the two following use cases.

    1. Add support for a new entity types.
    2. Correct errors observed in predictions.

    Ideas for this process

    • Get inspired by prodigy process here:
    • Is the new entity type a sub-type of an already existing entity type? (e.g. MAMMAL is sub-type of ANIMAL)
      • If Yes, then redirect this problem to Ontology Linking and Blue Graph
    • How do we provide estimates on how many training samples will be needed? Can we do this iteratively e.g. using #276 learning curves?
    • Shall we always train a "statistical model" or consider using an EntityRuler?
    • In any case, maybe at least some training samples are needed for testing? How many?

    Actions

    • [x] Create draft of process to add support for a new entity types.
    • [x] Create draft of process to correct errors observed in predictions.
    🔤 named-entity-recognition 
    opened by FrancescoCasalegno 16
  • CI aka Jenkins config

    CI aka Jenkins config

    The config lies here: bbsearch-jenkins

    So currently our CI does (more or less) the following things:

    1. Install package via setup.py i.e. pip install --upgrade .[dev]
    2. Run unit tests pytest

    I wanted to discuss addition of multiple other "code quality" tools but at the same time I do not want to impose some overkill requirements.

    • flake8 - non-zero exit code if PEP8 not satisifed
    • pydocstyle - non-zero exit code if docstrings missing/wrongly formatted w.r.t numpydoc
    • pytest-coverage - either just printing the coverage at the end (--cov-report=term) or if we want to be more extreme we can impose a minimum coverage --cov-fail-under=MIN

    Please feel free to share your opinions on this. @Stannislav @EmilieDel @FrancescoCasalegno

    opened by jankrepl 13
  • Test reproducibility of spaCy training

    Test reproducibility of spaCy training

    Context

    We have already seen (see BBS-198) that torch training results (i.e. model weights) are not bitwise reproducible.

    On the other hand, spaCy training has been reproducible until now. But we have been using tok2vec as a backbone, not transformer, so can this change the situation?

    Actions

    • [x] Test if spacy NER training is reproducible when using a transformer backbone instead of tok2vec.
    • [x] If No to the previous question, is the training also not reproducible with a frozen transformer.
    • [x] If training appears not to be reproducible, ask spaCy developers if this is expected—it seems indeed to contrast with the multiple times that "reproducibile" is mentioned in their docs.
    opened by FrancescoCasalegno 12
  • Compare runtimes of spaCy NER pipelines using CPU and GPU

    Compare runtimes of spaCy NER pipelines using CPU and GPU

    Description

    While adopting a transformer backbone for our spaCy NER models may be beneficial in terms of accuracy (see #335), this may also imply slower runtime with respect to using a simpler tok2vec.

    To run on GPUs using spaCy it seems that only 2 things are needed (see here for complete guide).

    1. pip install --upgrade spacy[<cuda_version>];
    2. specify spacy.require_gpu() before any spacy.load(some_model).

    The GPU version should be checked before installing the right spacy[<cuda_version>]. For instance, given

    $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Wed_Oct_23_19:24:38_PDT_2019
    Cuda compilation tools, release 10.2, V10.2.89
    

    we should pip install --upgrade spacy[cuda102], since we have cuda v10.2.

    Actions

    Collect results of runtimes of inference using spaCy pipelines in a variety of settings:

    • [x] running on CPU vs GPU
    • [x] using backbone tok2vec vs transformer
    • [x] with pipeline = ["transformer", "ner"] vs pipliene = ["transformer", "tagger", "attribute_ruler", "lemmatizer", "parser", "ner", "entity_ruler"]
    optimization 🔤 named-entity-recognition 
    opened by FrancescoCasalegno 12
  • [BBS-293] Migrate spaCy 2.x -> 3.x

    [BBS-293] Migrate spaCy 2.x -> 3.x

    Fixes #274.

    Description

    PR progress

    The detailed scope is available here: https://github.com/BlueBrain/Search/issues/274#issuecomment-801108330. See especially the section "Out of scope" for links to follow-up work.

    PR changes

    • Removed the installation of Prodigy in pipelines/ner/Dockerfile. Now spacy train is used.
    • Removed the patch from #268. Patched issue fixed in spaCy >= 3.0.4.
    • Migrated add_pipe(EntityRuler, ...) to spaCy 3. Now add_pipe(...) has a new API.
    • Migrated the NER training to spacy train and spaCy 3. Now Prodigy is no more needed to train NER models.
    • Convert .jsonl annotations to .spacy files for spaCy 3. See #308 for migrating to .spacy permanently.
    • Use the default configuration of spaCy 3 for NER training. See #309 for using the config.cfg from Prodigy.
    • Upgraded en_core_web_sm and scispaCy models for spaCy 3.
    • Upgraded spaCy to >= 3.0.4 and scispaCy to 0.4.0.

    Note on Prodigy for spaCy 3

    At the moment, Prodigy has not been release for spaCy 3 yet. The progress on the release of the new Prodigy could be followed here and here.

    Note on the NER performances

    The base models from scispaCy need to be upgraded (v0.2.5 to v0.4.0) to be loadable with spaCy 3.

    Improving the performances of the trained models is not part of the PR. See #309, #294, and #295 instead.

    There are performance changes compared to master(scores for 86844fe0fa3d9731c5bb0b4f9b43a3eeb57fe80d):

    • According to the F1 score from pipeline/ner/eval.py:

      • 8 entities have an increase, especially cell_compartment (+12 %), organism (+7.2 %), and drug (+ 5.8 %)
      • 1 entity has a negligible decrease: protein (-0.73 %)
      • 1 entity has a concerning decrease: pathway (-19 %)
    • According to the F1 score from Prodigy or spaCy:

      • 5 entities have an increase, especially organism (+31 %)
      • 1 entity has a negligible decrease: pathway (-0.58 %)
      • 3 entities have a concerning decrease: cell_compartment (-8.7 %), protein (-7.0 %), and drug (-2.5 %)

    There are probably issues with 3 of the 5 entities mentioned above. Indeed:

    • pathway has a catastrophic decrease according to eval.py (-19 %) but not spaCy (- 0.58 %)
    • cell_compartment has a huge increase according to eval.py (+12 %) but has the opposite according to spaCy (-8.7 %)
    • drug has an increase according to eval.py (+5.8 %) but has a decrease according to spaCy (-2.5%)

    The changes seem to be caused by data sampling and domain adaptation issues. See #321, 3rd and 4th points.

    Here are the full changes according to the F1 score from pipeline/ner/eval.py:

    entity | before | now | delta | % | ------------------ | ------: | ---: | ----: | --: | cell_compartment | 0.65 | 0.72 | 0.08 | 12 | cell_type | 0.64 | 0.64 | 0.00 | 0.0 | chemical | 0.54 | 0.55 | 0.00 | 0.4 | disease | 0.69 | 0.70 | 0.01 | 1.3 | drug | 0.60 | 0.64 | 0.03 | 5.8 | organ | 0.53 | 0.55 | 0.02 | 3.7 | organism | 0.54 | 0.58 | 0.04 | 7.2 | pathway | 0.58 | 0.47 | -0.11 | -19 | protein | 0.53 | 0.53 | -0.00 | -0.73 |

    Here are the full changes according to the F1 score from Prodigy or spaCy:

    entity | before | now | delta | % | ------------------ | ------: | ---: | ----: | --: | cell_compartment | 0.88 | 0.81 | -0.08 | -8.7 | cell_type | 0.77 | 0.78 | 0.01 | 1.7 | chemical | 0.57 | 0.63 | 0.06 | 10.5 | disease | 0.88 | 0.92 | 0.04 | 4.6 | drug | 0.77 | 0.75 | -0.02 | -2.5 | organ | 0.81 | 0.85 | 0.04 | 5.0 | organism | 0.68 | 0.89 | 0.21 | 31 | pathway | 0.84 | 0.84 | -0.00 | -0.58 | protein | 0.80 | 0.74 | -0.06 | -7.0 |

    Note on reproducibility

    The order of lines in model-best/vocab/strings.json changes between dvc repro -f calls. This leads to changes in pipeline/ner/dvc.lock where the NER models are declared.

    Otherwise, the outputs of the other parts of the pipeline (convert_annotations_*, add_er_*, eval_*) do not change . This therefore implies that the performances of the trained models do not change.

    A patch has been applied in d8b9f7407641f977a0091d61788a96c97b48fa4a for ordering strings.json deterministically. See also #327 for removing the patch when spaCy will have fixed it upstream.

    How to test?

    1. The following should execute without errors:
    git clone https://github.com/BlueBrain/Search
    git checkout bbs_293
    
    docker build <options> -f data_and_models/pipelines/ner/Dockerfile -t <image> .
    docker run -it -rm <options> --name <container> <image>
    
    dvc pull data_and_models/annotations/ner/*.dvc
    
    cd data_and_models/pipelines/ner/
    # This takes around 30 mins.
    dvc repro -f
    
    1. The following, executed after the block above, should return nothing:
    git diff
    

    Checklist

    • [ ] All checkable items from https://github.com/BlueBrain/Search/issues/274#issuecomment-801108330 are checked.
    • [x] This PR refers to an issue from the issue tracker.
    • [x] Documentation and whatsnew.rst updated.
    • [x] setup.py and requirements.txt updated with new dependencies.
    • [x] All CI tests pass.

    Tests failing

    1. add_pipe API change:
    ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.ner.EntityRecognizer object at 0x151ece910>
     - If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
     - If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
     - If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
    

    Change of syntax with an entity_ruler: spaCy<3:

    er = spacy.pipeline.EntityRuler(model, patterns=self.to_list())
    model.add_pipe(er, **add_pipe_kwargs)
    

    spaCy>=3:

    er = model.add_pipe("entity_ruler",
                        config={'validate': True, **add_pipe_kwargs})
    er.add_patterns(self.to_list())
    
    1. deepcopy of spacy model:
    TypeError: self.c cannot be converted to a Python object for pickling
    

    from tests/test_mining/test_attribute.py line 1225:

    extractor.ee_model = deepcopy(extractor.ee_model)
    
    new feature optimization 🦉 dvc dependencies 🔤 named-entity-recognition 
    opened by EmilieDel 12
  • [BBS-269] Parse chemprot dataset

    [BBS-269] Parse chemprot dataset

    Description

    This PR is introducing a script to parse Chemprot dataset into tsv files compatible for the training of biobert model (see Github Repo).

    Part of the ticket BBS-269.

    opened by EmilieDel 12
  • add remote relation extraction

    add remote relation extraction

    Fixes #{issue-id-number}.

    Description

    Please provide here a summary of the changes introduced by this PR.

    How to test?

    Please provide here instructions on how to test the changes introduced by this PR. (if some changes cannot be tested by automated tests)

    Checklist

    • [ ] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [ ] Unit tests added. (if needed)
    • [ ] Documentation and whatsnew.rst updated. (if needed)
    • [ ] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [ ] Type annotations added. (if a function is added or modified)
    • [ ] All CI tests pass.
    opened by drsantos89 0
  • add ner k8s

    add ner k8s

    Description

    Adds a function to perform and store the output of NER. NER is run remotely using the deployment on Kubernetes. It supports both ML and RULE-based approaches. The NER output and model version are stored in ES.

    Notes

    • One function, handle_conflits, is currently found in two repositories (this current PR and one repo on GitLab). It might be interesting to keep it only in BlueSearch and import it into the other repository.
    • The remote models are relatively slow when running 1 sample at a time (~1-2 paragraphs/s for ML and ~15 for RULE). Testing the remote model speed with locust revealed a maximum possible performant of ~15 and ~50 for ML and RULE models, respectively. A multiprocessing option was hence added.
    • pool.apply_async compies the arguments to a new memory location. The client object is not serializable and needs to be called inside the function if required.
    • by default, the function only updates the paragraphs which are empty or with an outdated model version. There is an option to force the update of every paragraph.
    • The JSON output of both models is saved in an ES field of type flattened. ("This data type can be useful for indexing objects with a large or unknown number of unique keys. Only one field mapping is created for the whole JSON object, which can help prevent a [mappings explosion (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings) from having too many distinct field mappings.")

    How to test?

    tests/unit/k8s/test_ner.py

    Checklist

    • [x] Unit tests added. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
  • Create JSONL configuration file for topic filtering

    Create JSONL configuration file for topic filtering

    Context

    • We have set up a pipeline stage that is able to determine the relevance of an article w.r.t. some user-given topics configuration.
    • Currently, the only config file we have behaves as a wildcard *.
    • We should instead have specific topic inclusion criteria for each archive type (arXiv, PMC, ...).

    Actions

    • [ ] Compile a JSON configuration file for topic filtering.
    🌪️ db-filter 
    opened by FrancescoCasalegno 0
  • feature/add-abstract-to-paragraphs-table

    feature/add-abstract-to-paragraphs-table

    Fixes #637.

    Description

    Adds the abstract to the paragraphs table so it can be searchable using semantic search.

    How to test?

    Abstract field added to the tests/unit/entrypoint/database/test_add_es.py

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [ ] Documentation and whatsnew.rst updated. (if needed)
    • [x] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
  • Unify abstract and section paragraphs

    Unify abstract and section paragraphs

    We should be able to assign an embedding to the abstract. However, currently the abstract is a separate field/attribute of the Article class.

    Todos

    • [x] Decide on the best design
    • [x] Implement it

    Some reference

    • https://github.com/BlueBrain/Search/pull/593
    🗄️ database 
    opened by jankrepl 1
  • Feature/embedings k8s

    Feature/embedings k8s

    Fixes #623

    Description

    Add function to update embedding on paragraphs without embeddings using a local model

    How to test?

    Embeddings are present in the database. test/unit/k8s/test_add_embeddings.py

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [x] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
Releases(v0.0.10)
Owner
The Blue Brain Project
Open Source Software produced and used by the Blue Brain Project
The Blue Brain Project
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
Automated Phrase Mining from Massive Text Corpora in Python.

Automated Phrase Mining from Massive Text Corpora in Python.

luozhouyang 28 Apr 15, 2021
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 6, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 5, 2023
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 7 Nov 2, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 5, 2021
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 6, 2023
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 7.5k Feb 17, 2021
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.4k Feb 17, 2021
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 3, 2023
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 6, 2023
An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

David Thorne 0 Feb 6, 2022