LMs for biomedical KG completion
This repository contains code to run the experiments described in:
Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (arXiv link)
Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A. Smith, Hannaneh Hajishirzi, Tom Hope
Data
The edge splits we used for our experiments can be downloaded using the following links:
Link | File size |
---|---|
RepoDB, transductive split | 11 MB |
RepoDB, inductive split | 11 MB |
Hetionet, transductive split | 49 MB |
Hetionet, inductive split | 49 MB |
MSI, transductive split | 813 MB |
MSI, inductive split | 813 MB |
Each of these filees should be placed in the subgraph
directory before running any of the experiment scripts. Please see the README.md
file in the subgraph
directory for more information on the edge split files. If you would like to recreate the edge splits yourself or construct new edge splits, use the scripts titled script/create_*_dataset.py
.
Environment
The environment.yml
file contains all of the necessary packages to use this code. We recommend using Anaconda/Miniconda to set up an environment, which you can do with the command
conda-env create -f environment.yml
Entity names and descriptions
The files that contain entity names and descriptions for all of the datasets can be found in data/processed
directory. If you would like to recreate these files yourself, you will need to use the scripts for each dataset found in data/script
.
Pre-tokenization
The main training script for the LMs src/lm/run.py
can take in pre-tokenized entity names and descriptions as input, and several of the training scripts in script/training
are set up to do so. If you would like to pre-tokenize text before fine-tuning, follow the instructions in script/pretokenize.py
. You can also pass in one of the .tsv
files found in data/processed
for the argument --info_filename
instead of a file with pre-tokenized text.
Training
All of the scripts for training models can be found in the src
directory. The script for training all KGE models is src/kge/run.py
, while the script for training LMs is src/lm/run.py
. Our code for training KGE models is heavily based on this code from the Open Graph Benchmark Github repository. Examples of how to use each of these scripts, including training with Slurm, can be found in the script/training
directory. This directory includes all of the scripts we used to run the experiments for the results in the paper.