EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

Related tags

Text Data & NLP BioLAMA

Overview

BioLAMA

BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CTD, UMLS, and Wikidata. Please see our paper Can Language Models be Biomedical Knowledge Bases? (Sung et al., 2021) for more details.

* The dataset for the BioLAMA probe is available at data.tar.gz

Getting Started

After the installation, you can easily try BioLAMA with manual prompts. When a subject is "flu" and you want to probe its symptoms from an LM, the input should be like "Flu has symptom such as [Y]."

# Set MODEL to bert-base-cased for BERT or dmis-lab/biobert-base-cased-v1.2 for BioBERT
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
python ./BioLAMA/cli_demo.py \
    --model_name_or_path ${MODEL}

Result:

Please enter input (e.g., Flu has symptoms such as [Y].):
hepatocellular carcinoma has symptoms such as [Y].
-------------------------
Rank    Prob    Pred
-------------------------
1       0.648   jaundice
2       0.223   abdominal pain
3       0.127   jaundice and ascites
4       0.11    ascites
5       0.086   hepatomegaly
6       0.074   obstructive jaundice
7       0.06    abdominal pain and jaundice
8       0.059   ascites and jaundice
9       0.043   anorexia and jaundice
10      0.042   fever and jaundice
-------------------------
Top1 prediction sentence:
"hepatocellular carcinoma has symptoms such as jaundice."

Installation

# Install torch with conda (please check your CUDA version)
conda create -n BioLAMA python=3.7
conda activate BioLAMA
conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch

# Install BioLAMA
git clone https://github.com/dmis-lab/BioLAMA.git
cd BioLAMA
pip install -r requirements.txt

Resources

Models

For BERT and BioBERT, we use checkpoints provided in the Huggingface Hub:

best-base-cased (for BERT)
dmis-lab/biobert-base-cased-v1.2 (for BioBERT)

Bio-LM is not provided in the Huggingface Hub. Therefore, we use the Bio-LM checkpoint released in link. Among the various versions of Bio-LMs, we use `RoBERTa-base-PM-Voc-hf'.

wget https://dl.fbaipublicfiles.com/biolm/RoBERTa-base-PM-Voc-hf.tar.gz
tar -xzvf RoBERTa-base-PM-Voc-hf.tar.gz 
rm -rf RoBERTa-base-PM-Voc-hf.tar.gz

Datasets

The dataset will take about 78 MB of space. Download data.tar.gz and uncompress it.

tar -xzvf data.tar.gz
rm -rf data.tar.gz

The directory tree of the data is like:

data
├── ctd
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── CD1
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
├── wikidata
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── P2175
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
└── umls
    ├── meta
    └── prompts

Important: Triples of UMLS is not provided due to the license. For those who want to probe LMs using triples of UMLS, we provide the pre-processing scripts for UMLS. Please follow this instruction.

Experiments

We provide two ways of probing PLMs with BioLAMA:

Manual Prompt
OptiPrompt

Manual Prompt

Manual Prompt probes PLMs using pre-defined manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl

python ./BioLAMA/run_manual.py \
    --model_name_or_path ${MODEL} \
    --prompt_path ${PROMPT_PATH} \
    --test_path "${TEST_PATH}" \
    --init_method confidence \
    --iter_method none \
    --num_mask 10 \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --output_dir ./output/${TASK}_manual

Result:

PID     Acc@1   Acc@5
-------------------------
P2175   9.40    21.11
P2176   22.46   39.75
P2293   2.24    11.43
P4044   9.47    19.47
P780    16.30   37.85
-------------------------
MACRO   11.97   25.92

OptiPrompt

OptiPrompt probes PLMs using embedding-based prompts starting from embeddings of manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TRAIN_PATH=./data/${TASK}/triples_processed/*/train.jsonl
DEV_PATH=./data/${TASK}/triples_processed/*/dev.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl

python ./BioLAMA/run_optiprompt.py \
    --model_name_or_path ${MODEL} \
    --train_path "${TRAIN_PATH}" \
    --dev_path "${DEV_PATH}" \
    --test_path "${TEST_PATH}" \
    --prompt_path ${PROMPT_PATH} \
    --num_mask 10 \
    --init_method confidence \
    --iter_method none \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --lr 3e-3 \
    --epochs 10 \
    --seed 0 \
    --prompt_token_len 5 \
    --init_manual_template \
    --output_dir ./output/${TASK}_optiprompt

Result:

PID     Acc@1   Acc@5
-------------------------
P2175   9.47    24.94
P2176   20.14   39.57
P2293   2.90    9.21
P4044   7.53    18.58
P780    12.98   33.43
-------------------------
MACRO   7.28    18.51

Acknowledgement

Parts of the code are modified from genewikiworld, X-FACTR, and OptiPrompt. We appreciate the authors for making their projects open-sourced.

Citations

@inproceedings{sung2021can,
    title={Can Language Models be Biomedical Knowledge Bases},
    author={Sung, Mujeen and Lee, Jinhyuk and Yi, Sean and Jeon, Minji and Kim, Sungdong and Kang, Jaewoo},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2021},
}

Comments

Probing with different datasets

Hi there! This work is really interesting, and so relevant to my own. I've been creating a fact-checked dataset of 300 tweets (all COVID-19 related, biomedical themed; fact-checked using crowdsourcing) as part of my Masters thesis and would really like to probe both BERT and BioBERT, using your probe. I've constrained the dataset quite heavily, only considering 'cause' relations, and I've annotated biomedical named entities, for example:

[Moderna vaccine TREATMENT] may cause [neurodegenerative disorders MEDICAL CONDITION] like [Alzheimer's MEDICAL CONDITION]

Unfortunately, I have not linked these to puids or uuids in any database, so I am not sure I would be able to make full use of the prompt generation. Would it make sense to feed these 300 instances (with masked objects) to the cli_demo.py for this dataset? Would it be possible to somehow calculate an evaluation score on this? As far as I can see, the cli_demo.py only returns predictions, but does not do any accuracy scoring.

Any help and ideas are really appreciated. Thanks in advance, and really great work!

opened by violenil 4
triples_processed of UMLS‘s data is empty?

I'm very interesting with your paper.But when I implemented your experiment,I found that triples_processed of UMLS‘s data is empty.Could you provide the dataset?

opened by duolatx 2
data link is not available

Hi, thank you for this good work. I just found the data link (http://nlp.dmis.korea.edu/projects/biolama/data.tar.gz) is not available. Can you help me to fix it? Thanks

opened by seasonyao 2
About the ”length_norm_coeff“
Hi, very interesting work. I notice that you set the default ”length_norm_coeff“ as 0, wouldn't it affect the decoding results? e.g., prefer the longer span?

# length norm length_norm_coeff = 0.0 lp = np.power(mask_len,length_norm_coeff) prob = np.exp(log_prob / lp)
opened by c-box 2
BERT VS RoBERTa

Hi, very interesting work. I have a question about BERT VS RoBERTa.

In original LAMA experiments, many people find BERT has a better performance than RoBERTa. However, I try the same thing in BioLAMA, and in my experiment results, RoBERTa is much better than BERT. I am not sure whether or not you tried the same thing (I do not see a related thing in your paper), if so, do you have the same result (Roberta >> BERT)

I try to explain it, and find one possible point: when I initialize the model with huggingface AutoModelWithLMHead, RoBERTa actually has everything from its checkpoint (it uses lm_head in the top), but BERT does not have 'cls.seq_relationship.weight', 'cls.seq_relationship.bias' in the head and need randomly init it. I think this definitely influences the BERT performance, right? Do you think this is a big issue? When you try BERT, did you see this warning, and how did you handle this or did you just ignore it? warning: Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

Thanks

opened by seasonyao 6

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

Related tags

Overview

BioLAMA

* The dataset for the BioLAMA probe is available at data.tar.gz

Getting Started

Quick Link

Installation

Resources

Models

Datasets

Experiments

Manual Prompt

OptiPrompt

Acknowledgement

Citations

Comments

Probing with different datasets

triples_processed of UMLS‘s data is empty?

data link is not available

About the ”length_norm_coeff“

BERT VS RoBERTa

Owner

DMIS Laboratory - Korea University

A full spaCy pipeline and models for scientific/biomedical documents.

A full spaCy pipeline and models for scientific/biomedical documents.

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Watson Natural Language Understanding and Knowledge Studio

Knowledge Oriented Programming Language

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

A library for finding knowledge neurons in pretrained transformer models.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

A python framework to transform natural language questions to queries in a database query language.

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)