UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Last update: Oct 25, 2022

Related tags

Deep Learning UmlsBERT

Overview

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

General info

This is the code that was used of the paper : UmlsBERT: Augmenting Contextual Embeddings with a Clinical Metathesaurus (NAACL 2021).

In this work, we introduced UmlsBERT, a contextual embedding model capable of integrating domain knowledge during pre-training. It was trained on biomedical corpora and uses the Unified Medical Language System (UMLS) clinical metathesaurus in two ways:

We proposed a new multi-label loss function for the pre-training of the Masked Language Modelling (Masked LM) task of UmlsBERT that considers the connections between medical words using the CUI attribute of UMLS.

We introduced a semantic group embedding that enriches the input embeddings process of UmlsBERT by forcing the model to take into consideration the association of the words that are part of the same semantic group.

Technologies

This project was created with python 3.7 and PyTorch 0.4.1 and it is based on the transformer github repo of the huggingface team

Setup

We recommend installing and running the code from within a virtual environment.

Creating a Conda Virtual Environment

First, download Anaconda from this link

Second, create a conda environment with python 3.7.

$ conda create -n umlsbert python=3.7

Upon restarting your terminal session, you can activate the conda environment:

$ conda activate umlsbert

Install the required python packages

In the project root directory, run the following to install the required packages.

pip3 install -r requirements.txt

Install from a VM

If you start a VM, please run the following command sequentially before install the required python packages. The following code example is for a vast.ai Virtual Machine.

apt-get update
apt install git-all
apt install python3-pip
apt-get install jupyter

Dowload pre-trained UmlsBERT model

In order to use pre-trained UmlsBERT model for the word embeddings (or the semantic embeddings), you need to dowload it into the folder examples/checkpoint/ from the link:

 wget -O umlsbert.tar.xz https://www.dropbox.com/s/kziiuyhv9ile00s/umlsbert.tar.xz?dl=0

into the folder examples/checkpoint/ and unzip it with the following command:

tar -xvf umlsbert.tar.xz

Reproduce UmlsBERT

Pretraining

The UmlsBERT was pretrained on the MIMIC data. Unfortunately, we cannot provide the text of the MIMIC III dataset as training course is mandatory in order to access the particular dataset.
The MIMIC III dataset can be downloaded from the following link
The pretraining an UmlsBERT model depends on data from NLTK so you'll have to download them. Run the Python interpreter (python3) and type the commands:

>>> import nltk
>>> nltk.download('punkt')

After downloading the NOTEEVENTS table in the examples/language-modeling/ folder, run the following python code that we provide in the examples/language-modeling/ folder to create the mimic_string.txt on the folder examples/language-modeling/:

python3 mimic.py

you can pre-trained a UmlsBERT model by running the following command on the examples/language-modeling/:

Example for pretraining Bio_clinicalBert:

python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1  --model_name_or_path  emilyalsentzer/Bio_ClinicalBERT  --mlm     --do_train     --learning_rate 5e-5     --max_steps 150000   --block_size 128   --save_steps 1000     --per_gpu_train_batch_size 32     --seed 42     --line_by_line      --train_data_file mimic_string.txt  --umls --config_name  config.json --med_document ./voc/vocab_updated.txt

Downstream Tasks

MedNLi task

MedNLI is available through the MIMIC-III derived data repository. Any individual certified to access MIMIC-III can access MedNLI through the following link
- Converting into an appropriate format: After downloading and unzipping the MedNLi dataset (mednli-a-natural-language-inference-dataset-for-the-clinical-domain-1.0.0.zip) on the folder examples/text-classification/dataset/mednli/, run the following python code in the examples/text-classification/dataset/mednli/ folder that we provide in order to convert the dataset into a format that is appropriate for the UmlsBERT model

python3  mednli.py

This python code will create the files: train.tsv,dev_matched.tsv and test_matched.tsv in the text-classification/dataset/mednli/mednli folder
We provide an example-notebook under the folder experiements/:
- experiements/MedNLI_task.ipynb

or directly run UmlsBert on the text-classification/ folder:

python3 run_glue.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert   --data_dir  dataset/mednli/mednli  --num_train_epochs 3 --per_device_train_batch_size 32  --learning_rate 1e-4   --do_train --do_eval  --do_predict  --task_name mnli --umls --med_document ./voc/vocab_updated.txt

NER task

Due to the copyright issue of i2b2 datasets, in order to download them follow the link.
- Converting into an appropriate format: Since we wanted to directly compare with the Bio_clinical_Bert we used their code in order to convert the i2b2 dataset to a format which is appropriate for the BERT architecture which can be found in the following link: link
We provide the code for converting the i2b2 dataset with the following instruction for each dataset:
i2b2 2006:
- In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2006_deid unzip the deid_surrogate_test_all_groundtruth_version2.zip and deid_surrogate_train_all_version2.zip
- run the create.sh scrip with the command ./create.sh
- The script will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2006 folder
i2b2 2010:
- In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2010_relations unzip the test_data.tar.gz, concept_assertion_relation_training_data.tar.gz and reference_standard_for_test_data.tar.gz
- Run the jupyter notebook Reformat.ipynb
- The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2010 folder
i2b2 2012:
- In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2012 unzip the 2012-07-15.original-annotation.release.tar.gz and 2012-08-08.test-data.event-timex-groundtruth.tar.gz
- Run the jupyter notebook Reformat.ipynb
- The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2012 folder
i2b2 2014:
- In the folder token-classification/dataset/i2b2_preprocessing/i2b2_2014_deid_hf_risk unzip the 2014_training-PHI-Gold-Set1.tar.gz,training-PHI-Gold-Set2.tar.gz and testing-PHI-Gold-fixed.tar.gz
- Run the jupyter notebook Reformat.ipynb
- The notebook will create the files: label.txt, dev.txt, test.txt, train.txt in the token-classification/dataset/NER/2014 folder
We provide an example-notebook under the folder experiements/:
- experiements/NER_task.ipynb

or directly run UmlsBert on the token-classification/ folder:

python3 run_ner.py --output_dir ./models/medicalBert-v1 --model_name_or_path  ../checkpoint/umlsbert    --labels dataset/NER/2006/label.txt --data_dir  dataset/NER/2006 --do_train --num_train_epochs 20 --per_device_train_batch_size 32  --learning_rate 1e-4  --do_predict --do_eval --umls --med_document ./voc/vocab_updated.txt

If you find our work useful, can cite our paper using:

@misc{michalopoulos2020umlsbert,
      title={UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus}, 
      author={George Michalopoulos and Yuanxin Wang and Hussam Kaka and Helen Chen and Alex Wong},
      year={2020},
      eprint={2010.10391},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

run_language_modeling.py
Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line: python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

issue 1 - it said the tokenizer did not have an argument called max_len. this was the error: 'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' ' Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue

issue 2 - the current error message i am getting is: 'TypeError: init() got an unexpected keyword argument 'tui_ids''

When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at https://github.com/huggingface/transformers/issues/8739 They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.'

Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?

I am running the scripts on google colab. This is the complete output I get when i run: python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

output:

2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False 07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=0, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=500, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, greater_is_better=None, group_by_length=False, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5, logging_first_step=False, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=150000, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=clinicalBert-v1, push_to_hub_organization=None, push_to_hub_token=None, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1, save_on_each_node=False, save_steps=1000, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) /usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class AutoModelWithLMHead is deprecated and will be removed in a future version. Please use AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder models. FutureWarning, Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Traceback (most recent call last): File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in main() File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main tui_ids=tui_ids) if training_args.do_train else None File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset tui_ids=tui_ids) TypeError: init() got an unexpected keyword argument 'tui_ids'

P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.

Thank you.

Best, Jaya
opened by jayachaturvedi 1
Confusing checkpoint name between UmlsBERT and CODER.

Hello, my work: CODER: Knowledge infused cross-lingual medical term embedding for term normalization (https://github.com/GanjinZero/CODER) (which is a contemporary work with your UmlSBERT) used confusing checkpoint name GanjinZero/UMLSBert_ENG in huggingface. Would you mind adding a disambiguation line in your README.md for people using my checkpoint. Thank you!

opened by GanjinZero 0
error with tui_ids

Hello, I have been trying to run the run_language_modeling.py script (even though it is deprecated, I found a workaround through some suggestions on the transformers git page). I get the following error:

Traceback (most recent call last): File "./language-modeling/run_language_modeling.py", line 361, in main() File "./language-modeling/run_language_modeling.py", line 254, in main tui_ids=tui_ids) if training_args.do_train else None File "./language-modeling/run_language_modeling.py", line 142, in get_dataset tui_ids=tui_ids) TypeError: init() got an unexpected keyword argument 'tui_ids'

I am at a loss at how to fix this or what might be causing it? Has anyone else faced this and been able to fix it?

Thank you.

Jaya

opened by jayachaturvedi 0
deprecated script

Hello, thank you for developing this model. I am very keen on using it in my project. However, I have noticed that the run_language_modeling.py script does not run anymore, and the original script from PyTorch has been deprecated. It looks like they have replaced the old script with some new ones, but of course, they are missing the UMLS component which is key to running UmlsBERT. I wondered if you have created or plan to create a new version of this script. Thank you!

opened by jayachaturvedi 0
Is it possible to have access to the NER part directly?

Hello and thank you for your great work.

I would like to test UmlsBERT on a self supervised model to learn sentence embeddings. I'm looking to simply use your checkpoint to do it. But is it possible for me to have access to the NER part? which given a tokenized sentence, classify each token in the 42 UMLS categories to then use the umls embeddings of BERT that you propose. I saw in the paper that you mention Ctakes to do it but I don't see where it appears in the repository. The best for me would be to simply download your checkpoint (which I did) and then incorporate a part of your code to my other repo to test it. I looked at the NER jupyter file but I don't see the inference part of tokenized sentence with Ctakes.

I'll be so grateful for any advice on this.

Update : I have seen that vocab_updated.txt is in the repo and corresponds to the umls tags. So if I understand right, the sentences don't go trough a custom NER network (like LSTM or other) to get the classes but if a sentence contains tokens classified as tui they will have a special embeddings right?

Alexandre

opened by alexjout 1
How to use pretrained UmlsBERT to get embeddings for all the UMLS terms?

I want to load the pretrained UmlsBERT model to generate vector representations/embeddings of all the medical terms in UMLS. Which library to use to load this model since it is a modified version of BERT? Also, which layer or combination of layers provides the best representation of the vectors?

opened by AneryPatel 3

Owner

GitHub

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

45 Dec 8, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

Get a Grip! - A robotic system for remote clinical environments.

Get a Grip! Within clinical environments, sterilization is an essential procedure for disinfecting surgical and medical instruments. For our engineeri

1 Jan 5, 2022

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

2 Dec 15, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

DKPNet ICCV 2021 Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting Baseline of DKPNet is availa

19 Oct 14, 2022

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

Clairvoyance: A Pipeline Toolkit for Medical Time Series Authors: van der Schaar Lab This repository contains implementations of Clairvoyance: A Pipel

$van_der_Schaar \LAB$ 89 Dec 7, 2022

Pytorch Code for "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation"

Medical-Transformer Pytorch Code for the paper "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation" About this repo: This repo

615 Dec 25, 2022

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

1.2k Jan 4, 2023

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

152 Dec 28, 2022

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

1.6k Jan 6, 2023

Code and data for ImageCoDe, a contextual vison-and-language benchmark

ImageCoDe This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions. Data All collected descriptions for the

27 Dec 2, 2022

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

607 Dec 31, 2022

Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

109 Dec 14, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Related tags

Overview

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

General info

Technologies

Setup

Creating a Conda Virtual Environment

Install the required python packages

Install from a VM

Dowload pre-trained UmlsBERT model

Reproduce UmlsBERT

Pretraining

Downstream Tasks

Comments

run_language_modeling.py

Confusing checkpoint name between UmlsBERT and CODER.

error with tui_ids

deprecated script

Is it possible to have access to the NER part directly?

How to use pretrained UmlsBERT to get embeddings for all the UMLS terms?

Owner

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Get a Grip! - A robotic system for remote clinical environments.

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

Pytorch Code for "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation"

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Code and data for ImageCoDe, a contextual vison-and-language benchmark

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Chinese clinical named entity recognition using pre-trained BERT model

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

🤖 A Python library for learning and evaluating knowledge graph embeddings

Convolutional 2D Knowledge Graph Embeddings resources

ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs