Using pretrained language models for biomedical knowledge graph completion.

Rahul Nadkarni

Last update: Nov 30, 2022

Related tags

Deep Learning lm-bio-kgc

Overview

LMs for biomedical KG completion

This repository contains code to run the experiments described in:

Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (arXiv link)
Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A. Smith, Hannaneh Hajishirzi, Tom Hope

Data

The edge splits we used for our experiments can be downloaded using the following links:

Link	File size
RepoDB, transductive split	11 MB
RepoDB, inductive split	11 MB
Hetionet, transductive split	49 MB
Hetionet, inductive split	49 MB
MSI, transductive split	813 MB
MSI, inductive split	813 MB

Each of these filees should be placed in the subgraph directory before running any of the experiment scripts. Please see the README.md file in the subgraph directory for more information on the edge split files. If you would like to recreate the edge splits yourself or construct new edge splits, use the scripts titled script/create_*_dataset.py.

Environment

The environment.yml file contains all of the necessary packages to use this code. We recommend using Anaconda/Miniconda to set up an environment, which you can do with the command

conda-env create -f environment.yml

Entity names and descriptions

The files that contain entity names and descriptions for all of the datasets can be found in data/processed directory. If you would like to recreate these files yourself, you will need to use the scripts for each dataset found in data/script.

Pre-tokenization

The main training script for the LMs src/lm/run.py can take in pre-tokenized entity names and descriptions as input, and several of the training scripts in script/training are set up to do so. If you would like to pre-tokenize text before fine-tuning, follow the instructions in script/pretokenize.py. You can also pass in one of the .tsv files found in data/processed for the argument --info_filename instead of a file with pre-tokenized text.

Training

All of the scripts for training models can be found in the src directory. The script for training all KGE models is src/kge/run.py, while the script for training LMs is src/lm/run.py. Our code for training KGE models is heavily based on this code from the Open Graph Benchmark Github repository. Examples of how to use each of these scripts, including training with Slurm, can be found in the script/training directory. This directory includes all of the scripts we used to run the experiments for the results in the paper.

You might also like...

A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.

Corp-Rel is a PoC of Corpartion Relationship Knowledge Graph System. It's built on top of the Open Source Graph Database: Nebula Graph with a dataset

20 Dec 11, 2022

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

U-Net Implementation By Christopher Ley This is my interpretation and implementation of the famous paper "U-Net: Convolutional Networks for Biomedical

1 Jan 6, 2022

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

67 Dec 20, 2022

Build a medical knowledge graph based on Unified Language Medical System (UMLS)

UMLS-Graph Build a medical knowledge graph based on Unified Language Medical System (UMLS) Requisite Install MySQL Server 5.6 and import UMLS data int

6 Dec 25, 2022

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 3, 2023

Language models are open knowledge graphs ( non official implementation )

language-models-are-knowledge-graphs-pytorch Language models are open knowledge graphs ( work in progress ) A non official reimplementation of Languag

132 Dec 18, 2022

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Knowledge-Inheritance Source code paper: Knowledge Inheritance for Pre-trained Language Models (preprint). The trained model parameters (in Fairseq fo

31 Nov 19, 2022

Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

308 Jan 5, 2023

Comments

CUDA error: out of memory when training kge model

Hi, I'm running this experiments:

python run.py --cuda --do_train --do_valid --do_test --dataset repodb --model DistMult --save_path ./results --subgraph /absolute-path-to/repodb-edge-split-f0.2-neg500-ind-s42.pt

However I get a CUDA error: out of memory immediately File "run.py", line 276, in main kge_model = kge_model.cuda()

Can I ask you what was your experimental setup in terms of GPU and memory?

opened by giuliacassara 2
Relation types for RepoDB?

Hi, thanks for your contribution with this work! Quick question, does RepoDB have multiple relation types? I couldn't find anything about relations for RepoDB in the data/processed/ directory, and the splits you provide suggest that there's only one type of relation between entities for this dataset.

Tara

opened by tsafavi 2
Suggested small fix when loading msi model

Hi,

I was following your readme, but when I load the model for msi I get the following error:

Traceback (most recent call last): File "test.py", line 17, in <module> model.load_state_dict(state_dict) File "/home/gcassara/miniconda3/envs/lm-bio-kgc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for KGBERT: size mismatch for relation_head.weight: copying a param with shape torch.Size([6, 768]) from checkpoint, the shape in current model is torch.Size([5, 768]). size mismatch for relation_head.bias: copying a param with shape torch.Size([6]) from checkpoint, the shape in current model is torch.Size([5]).

I think the correct n__relations is

nrelations = {'repodb' : 1, 'hetionet' : 4, 'msi' : 6}

opened by giuliacassara 1
Preprocess datasets and create triplets with (head, relation, tail)

Hi Rahul, many thanks for your quick responses! I would like to recreate the processed files myself, by using the scripts in data/script. Also, I want to create for msi and hetionet a triplet file with explicit reference to the relationship (I know that you are not supporting this feature, it's from my initiative) . When I launch get_description_msi.py the script expects in the arguments msi_file, go_file, entrez_file, which I don't have. The same for preprocess_msi.py, which expect a directory which is the location of msi files. Looking in depth at your code

files = {('drug', 'protein') : '1_drug_to_protein.tsv', ('disease', 'protein') : '2_indication_to_protein.tsv', ('protein', 'protein') : '3_protein_to_protein.tsv', ('protein', 'function') : '4_protein_to_biological_function.tsv', ('function', 'function') : '5_biological_function_to_biological_function.tsv', ('drug', 'disease') : '6_drug_indication_df.tsv'}

I saw that these files are what I really need to build my triplets files, although I don't know how you created them. Can you please send me these files or tell me how I can reproduce them?

opened by giuliacassara 1

Owner

Rahul Nadkarni

Computer Science Ph.D. student

GitHub

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

68 Dec 28, 2022

CBKH: The Cornell Biomedical Knowledge Hub

Cornell Biomedical Knowledge Hub (CBKH) CBKG integrates data from 18 publicly available biomedical databases. The current version of CBKG contains a t

44 Dec 21, 2022

A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

96 Dec 21, 2022

MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

MVGCN MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks. Developer: Fu Hait

13 Dec 1, 2022

git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

96 Nov 5, 2022

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

53 Jul 4, 2022

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging To appear on KDD'21...[pdf] This project provides an unsupervised framework for mining and

146 Dec 22, 2022

Using pretrained language models for biomedical knowledge graph completion.

Related tags

Overview

LMs for biomedical KG completion

Data

Environment

Entity names and descriptions

Pre-tokenization

Training

You might also like...

A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.

QA-GNN: Question Answering using Language Models and Knowledge Graphs

U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Build a medical knowledge graph based on Unified Language Medical System (UMLS)

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Language models are open knowledge graphs ( non official implementation )

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Rethinking the U-Net architecture for multimodal biomedical image segmentation

Comments

CUDA error: out of memory when training kge model

Relation types for RepoDB?

Suggested small fix when loading msi model

Preprocess datasets and create triplets with (head, relation, tail)

Owner

Rahul Nadkarni

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

CBKH: The Cornell Biomedical Knowledge Hub

A library for finding knowledge neurons in pretrained transformer models.

MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Measuring and Improving Consistency in Pretrained Language Models

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer