[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Michihiro Yasunaga

Last update: Jan 1, 2023

Related tags

Deep Learning knowledge transformer question-answering pretrained-models language-model graph-machine-learning biomedical-applications pretraining

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

This repo provides the model, code & data of our paper: LinkBERT: Pretraining Language Models with Document Links (ACL 2022). [PDF] [HuggingFace Models]

Overview

LinkBERT is a new pretrained language model (improvement of BERT) that captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides using a single document as in BERT.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

1. Pretrained Models

We release the pretrained LinkBERT (-base and -large sizes) for both the general domain and biomedical domain. These models have the same format as the HuggingFace BERT models, and you can easily switch them with LinkBERT models.

Model	Size	Domain	Pretraining Corpus	Download Link ( 🤗 HuggingFace)
LinkBERT-base	110M parameters	General	Wikipedia with hyperlinks	michiyasunaga/LinkBERT-base
LinkBERT-large	340M parameters	General	Wikipedia with hyperlinks	michiyasunaga/LinkBERT-large
BioLinkBERT-base	110M parameters	Biomedicine	PubMed with citation links	michiyasunaga/BioLinkBERT-base
BioLinkBERT-large	340M parameters	Biomedicine	PubMed with citation links	michiyasunaga/BioLinkBERT-large

To use these models in 🤗 Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

To fine-tune the models, see Section 2 & 3 below. When fine-tuned on downstream tasks, LinkBERT achieves the following results.
General benchmarks (MRQA and GLUE):

	HotpotQA	TriviaQA	SearchQA	NaturalQ	NewsQA	SQuAD	GLUE
	F1	F1	F1	F1	F1	F1	Avg score
BERT-base	76.0	70.3	74.2	76.5	65.7	88.7	79.2
LinkBERT-base	78.2	73.9	76.8	78.3	69.3	90.1	79.6
BERT-large	78.1	73.7	78.3	79.0	70.9	91.1	80.7
LinkBERT-large	80.8	78.2	80.5	81.0	72.6	92.7	81.1

Biomedical benchmarks (BLURB, MedQA, MMLU, etc): BioLinkBERT attains new state-of-the-art 😊

	BLURB score	PubMedQA	BioASQ	MedQA-USMLE
PubmedBERT-base	81.10	55.8	87.5	38.1
BioLinkBERT-base	83.39	70.2	91.4	40.0
BioLinkBERT-large	84.30	72.2	94.8	44.6

	MMLU-professional medicine
GPT-3 (175 params)	38.7
UnifiedQA (11B params)	43.2
BioLinkBERT-large (340M params)	50.7

2. Set up environment and data

Environment

Run the following commands to create a conda environment:

conda create -n linkbert python=3.8
source activate linkbert
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval

Data

You can download the preprocessed datasets on which we evaluated LinkBERT from [here]. Simply download this zip file and unzip it. This includes:

MRQA question answering datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD)
BLURB biomedical NLP datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.)
MedQA-USMLE biomedical reasoning dataset.
MMLU-professional medicine reasoning dataset.

They are all preprocessed in the HuggingFace dataset format.

If you would like to preprocess the raw data from scratch, you can take the following steps:

First download the raw datasets from the original sources by following instructions in scripts/download_raw_data.sh
Then run the preprocessing scripts scripts/preprocess_{mrqa,blurb,medqa,mmlu}.py.

3. Fine-tune LinkBERT

Change the working directory to src/, and follow the instructions below for each dataset.

MRQA

To fine-tune for the MRQA datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD), run commands listed in run_examples_mrqa_linkbert-{base,large}.sh.

BLURB

To fine-tune for the BLURB biomedial datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.), run commands listed in run_examples_blurb_biolinkbert-{base,large}.sh.

MedQA & MMLU

To fine-tune for the MedQA-USMLE dataset, run commands listed in run_examples_medqa_biolinkbert-{base,large}.sh.

To evaluate the fine-tuned model additionally on MMLU-professional medicine, run the commands listed at the bottom of run_examples_medqa_biolinkbert-large.sh.

Reproducibility

We also provide Codalab worksheet, on which we record our experiments. You may find it useful for replicating the experiments using the same model, code, data, and environment.

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}

You might also like...

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

NeuralWOZ This code is official implementation of "NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation". Sungdong Kim, Mi

31 Oct 25, 2022

This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

182 Jan 7, 2023

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and m

408 Jan 1, 2023

Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

51 Nov 22, 2022

Adds timm pretrained backbone to pytorch's FasterRcnn model

timmFasterRcnn model_config.py - it returns the model,feat_sizes,output channel and the feat layer names, which is reqd by the Add_FPN.py file Add_FP

12 Dec 3, 2022

A toolkit for document-level event extraction, containing some SOTA model implementations

❤️ A Toolkit for Document-level Event Extraction with & without Triggers Hi, there 👋 . Thanks for your stay in this repo. This project aims at buildi

159 Dec 22, 2022

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

NTIRE 2022 - Image Inpainting Challenge Important dates 2022.02.01: Release of train data (input and output images) and validation data (only input) 2

37 Nov 27, 2022

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

87 Jan 8, 2023

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

274 Jan 5, 2023

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Related tags

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

Overview

1. Pretrained Models

2. Set up environment and data

Environment

Data

3. Fine-tune LinkBERT

MRQA

BLURB

MedQA & MMLU

Reproducibility

Citation

You might also like...

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Replication of Pix2Seq with Pretrained Model

Adds timm pretrained backbone to pytorch's FasterRcnn model

A toolkit for document-level event extraction, containing some SOTA model implementations

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

Owner

Michihiro Yasunaga

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Author: Wenhao Yu ([email protected]). ACL 2022. Commonsense Reasoning on Knowledge Graph for Text Generation

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Using pretrained language models for biomedical knowledge graph completion.

Measuring and Improving Consistency in Pretrained Language Models

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning