Using pretrained language models for biomedical knowledge graph completion.

Overview

LMs for biomedical KG completion

This repository contains code to run the experiments described in:

Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (arXiv link)
Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A. Smith, Hannaneh Hajishirzi, Tom Hope

Data

The edge splits we used for our experiments can be downloaded using the following links:

Link File size
RepoDB, transductive split 11 MB
RepoDB, inductive split 11 MB
Hetionet, transductive split 49 MB
Hetionet, inductive split 49 MB
MSI, transductive split 813 MB
MSI, inductive split 813 MB

Each of these filees should be placed in the subgraph directory before running any of the experiment scripts. Please see the README.md file in the subgraph directory for more information on the edge split files. If you would like to recreate the edge splits yourself or construct new edge splits, use the scripts titled script/create_*_dataset.py.

Environment

The environment.yml file contains all of the necessary packages to use this code. We recommend using Anaconda/Miniconda to set up an environment, which you can do with the command

conda-env create -f environment.yml

Entity names and descriptions

The files that contain entity names and descriptions for all of the datasets can be found in data/processed directory. If you would like to recreate these files yourself, you will need to use the scripts for each dataset found in data/script.

Pre-tokenization

The main training script for the LMs src/lm/run.py can take in pre-tokenized entity names and descriptions as input, and several of the training scripts in script/training are set up to do so. If you would like to pre-tokenize text before fine-tuning, follow the instructions in script/pretokenize.py. You can also pass in one of the .tsv files found in data/processed for the argument --info_filename instead of a file with pre-tokenized text.

Training

All of the scripts for training models can be found in the src directory. The script for training all KGE models is src/kge/run.py, while the script for training LMs is src/lm/run.py. Our code for training KGE models is heavily based on this code from the Open Graph Benchmark Github repository. Examples of how to use each of these scripts, including training with Slurm, can be found in the script/training directory. This directory includes all of the scripts we used to run the experiments for the results in the paper.

You might also like...
A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.
A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.

Corp-Rel is a PoC of Corpartion Relationship Knowledge Graph System. It's built on top of the Open Source Graph Database: Nebula Graph with a dataset

QA-GNN: Question Answering using Language Models and Knowledge Graphs
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

 U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation
U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

U-Net Implementation By Christopher Ley This is my interpretation and implementation of the famous paper "U-Net: Convolutional Networks for Biomedical

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Build a medical knowledge graph based on Unified Language Medical System (UMLS)

UMLS-Graph Build a medical knowledge graph based on Unified Language Medical System (UMLS) Requisite Install MySQL Server 5.6 and import UMLS data int

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Language models are open knowledge graphs ( non official implementation )
Language models are open knowledge graphs ( non official implementation )

language-models-are-knowledge-graphs-pytorch Language models are open knowledge graphs ( work in progress ) A non official reimplementation of Languag

Source code for paper: Knowledge Inheritance for Pre-trained Language Models
Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Knowledge-Inheritance Source code paper: Knowledge Inheritance for Pre-trained Language Models (preprint). The trained model parameters (in Fairseq fo

 Rethinking the U-Net architecture for multimodal biomedical image segmentation
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Comments
  • CUDA error: out of memory when training kge model

    CUDA error: out of memory when training kge model

    Hi, I'm running this experiments:

    python run.py --cuda --do_train --do_valid --do_test --dataset repodb --model DistMult --save_path ./results --subgraph /absolute-path-to/repodb-edge-split-f0.2-neg500-ind-s42.pt

    However I get a CUDA error: out of memory immediately File "run.py", line 276, in main kge_model = kge_model.cuda()

    Can I ask you what was your experimental setup in terms of GPU and memory?

    opened by giuliacassara 2
  • Relation types for RepoDB?

    Relation types for RepoDB?

    Hi, thanks for your contribution with this work! Quick question, does RepoDB have multiple relation types? I couldn't find anything about relations for RepoDB in the data/processed/ directory, and the splits you provide suggest that there's only one type of relation between entities for this dataset.

    Tara

    opened by tsafavi 2
  • Suggested small fix when loading msi model

    Suggested small fix when loading msi model

    Hi,

    I was following your readme, but when I load the model for msi I get the following error:

    Traceback (most recent call last): File "test.py", line 17, in <module> model.load_state_dict(state_dict) File "/home/gcassara/miniconda3/envs/lm-bio-kgc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for KGBERT: size mismatch for relation_head.weight: copying a param with shape torch.Size([6, 768]) from checkpoint, the shape in current model is torch.Size([5, 768]). size mismatch for relation_head.bias: copying a param with shape torch.Size([6]) from checkpoint, the shape in current model is torch.Size([5]).

    I think the correct n__relations is

    nrelations = {'repodb' : 1, 'hetionet' : 4, 'msi' : 6}

    opened by giuliacassara 1
  • Preprocess datasets and create triplets with (head, relation, tail)

    Preprocess datasets and create triplets with (head, relation, tail)

    Hi Rahul, many thanks for your quick responses! I would like to recreate the processed files myself, by using the scripts in data/script. Also, I want to create for msi and hetionet a triplet file with explicit reference to the relationship (I know that you are not supporting this feature, it's from my initiative) . When I launch get_description_msi.py the script expects in the arguments msi_file, go_file, entrez_file, which I don't have. The same for preprocess_msi.py, which expect a directory which is the location of msi files. Looking in depth at your code

    files = {('drug', 'protein') : '1_drug_to_protein.tsv', ('disease', 'protein') : '2_indication_to_protein.tsv', ('protein', 'protein') : '3_protein_to_protein.tsv', ('protein', 'function') : '4_protein_to_biological_function.tsv', ('function', 'function') : '5_biological_function_to_biological_function.tsv', ('drug', 'disease') : '6_drug_indication_df.tsv'}

    I saw that these files are what I really need to build my triplets files, although I don't know how you created them. Can you please send me these files or tell me how I can reproduce them?

    opened by giuliacassara 1
Owner
Rahul Nadkarni
Computer Science Ph.D. student
Rahul Nadkarni
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
CBKH: The Cornell Biomedical Knowledge Hub

Cornell Biomedical Knowledge Hub (CBKH) CBKG integrates data from 18 publicly available biomedical databases. The current version of CBKG contains a t

null 44 Dec 21, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

MVGCN MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks. Developer: Fu Hait

null 13 Dec 1, 2022
git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

AI2 96 Nov 5, 2022
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 4, 2022
This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging To appear on KDD'21...[pdf] This project provides an unsupervised framework for mining and

Xiaotao Gu 146 Dec 22, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 2, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Facebook Research 605 Jan 2, 2023