[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Cambridge Language Technology Lab

Last update: Dec 7, 2022

Related tags

Deep Learning nlp machine-learning representation-learning language-model lexical-semantics naacl2021 bio-nlp

Overview

SapBERT: Self-alignment pretraining for BERT

This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations [arxiv]; and our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [PDF].

Huggingface Models

[SapBERT]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-XLMR]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-mean-token]

Same as the standard SapBERT but trained with mean-pooling instead of [CLS] representations.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Train SapBERT

Prepare training data as insrtructed in data/generate_pretraining_data.ipynb.

Run:

cd umls_pretraining
./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

Evaluate SapBERT

Please view evaluation/README.md for details.

Citations

@article{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	journal={arXiv preprint arXiv:2010.11784},
	year={2020}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

OptiPrompt This is the PyTorch implementation of the paper Factual Probing Is [MASK]: Learning vs. Learning to Recall. We propose OptiPrompt, a simple

150 Dec 20, 2022

Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

30 Jan 1, 2023

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

56 Jan 2, 2023

Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

ProcrustEs-KGE Paddle implementation for Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis 🙈 A more detailed re

4 Jun 9, 2021

Comments

Entity Span --> CUI

Hi - what's the simplest way to go from entity span to CUI? It seems the HuggingFace model just gets you the [CLS] hidden state representation and you need to use that to find the nearest neighbor in UMLS. But it wasn't clear how to get that UMLS index

opened by griff4692 2
MedMentions Dictionary file created

Hello Team Thank you for your great contribution. Can you please brief me on how was the medmentions dictionary file was created to run evaluations.

Thanks Saranya

opened by saranyakrishm 1
Tokenizer

Is there any information on the tokenizer from HuggingFace?

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")

I assume it's the same as PubmedBERT, which I presume is using an 'in-domain' vocabulary. Just would love confirmation! thanks

opened by griff4692 1
Details on fine-tuning data

Hello and thanks for sharing this great project.

Regarding fine-tuning of SapBERT, the README states

For finetuning on your customised dataset, generate data in the format of [...] where entity_name_1 and entity_name_2 are synonym pairs (belonging to the same concept concept_id) sampled from a given labelled dataset.

Are there any examples on how this looks exactly for the datasets (NCBI Disease, Cometa, etc.) used in the evaluation?

opened by phlobo 0

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Related tags

Overview

SapBERT: Self-alignment pretraining for BERT

Huggingface Models

[SapBERT]

[SapBERT-XLMR]

[SapBERT-mean-token]

Environment

Train SapBERT

Evaluate SapBERT

Citations

Acknowledgement

License

You might also like...

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

Open-Ended Commonsense Reasoning (NAACL 2021)

Pytorch implementation of Supporting Clustering with Contrastive Learning, NAACL 2021

Comments

Entity Span --> CUI

MedMentions Dictionary file created

Tokenizer

Details on fine-tuning data

Owner

Cambridge Language Technology Lab

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Self-training with Weak Supervision (NAACL 2021)

[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"