SciBERT is a BERT model trained on scientific text.

Overview

PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC

SciBERT

SciBERT is a BERT model trained on scientific text.

  • SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

  • SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (basevocab) for comparison.

  • It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the paper. Evaluation code and data are included in this repo.

Downloading Trained Models

Update! SciBERT models now installable directly within Huggingface's framework under the allenai org:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library, and this repo shows how to use it in AllenNLP. All combinations of scivocab and basevocab, cased and uncased models are available below. Our evaluation shows that scivocab-uncased usually gives the best results.

Tensorflow Models

PyTorch AllenNLP Models

PyTorch HuggingFace Models

Using SciBERT in your own model

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Training new models using AllenNLP

To run experiments on different tasks and reproduce our results in the paper, you need to first setup the Python 3.6 environment:

pip install -r requirements.txt

which will install dependencies like AllenNLP.

Use the scibert/scripts/train_allennlp_local.sh script as an example of how to run an experiment (you'll need to modify paths and variable names like TASK and DATASET).

We include a broad set of scientific nlp datasets under the data/ directory across the following tasks. Each task has a sub-directory of available datasets.

├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent
    ├── mag
    ├── rct-20k
    ├── sci-cite
    └── sciie-relation-extraction

For example to run the model on the Named Entity Recognition (NER) task and on the BC5CDR dataset (BioCreative V CDR), modify the scibert/train_allennlp_local.sh script according to:

DATASET='bc5cdr'
TASK='ner'
...

Decompress the PyTorch model that you downloaded using
tar -xvf scibert_scivocab_uncased.tar
The results will be in the scibert_scivocab_uncased directory containing two files: A vocabulary file (vocab.txt) and a weights file (weights.tar.gz). Copy the files to your desired location and then set correct paths for BERT_WEIGHTS and BERT_VOCAB in the script:

export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

Finally run the script:

./scibert/scripts/train_allennlp_local.sh [serialization-directory]

Where [serialization-directory] is the path to an output directory where the model files will be stored.

Citing

If you use SciBERT in your research, please cite SciBERT: Pretrained Language Model for Scientific Text.

@inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  year={2019},
  booktitle={EMNLP},
  Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • Pretrained sciroBERTa weights release in the works?

    Pretrained sciroBERTa weights release in the works?

    Given the success of roBERTa https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ in GLUE benchmarks and alike, is a training of roBERTa over the semantic scholar corpus planned for release in the foreseeable future, in this repo?

    Otherwise, can someone provide hints about how to train a roBERTa model on the semantic scholar corpus and the compute time needed for the purpose? Thanks!

    opened by davidefiocco 16
  • TypeError: '_NamespacePath' object is not subscriptable

    TypeError: '_NamespacePath' object is not subscriptable

    when I finished the previous work mentioned aboved, I encountered the problem: src/allennlp/allennlp/common/util.py line316, in import_submodules path_string = ' ' if not path else path[0] TypeError: '_NamespacePath' object is not subscriptable.....

    what should I do to deal with the problem?? Thanks in advance

    opened by Nicozwy 11
  • Pretraining SciBERT

    Pretraining SciBERT

    Hi, The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!

    Yichong

    opened by xycforgithub 8
  • Question Over Punctuation Charts in Vocab Creation

    Question Over Punctuation Charts in Vocab Creation

    Hi

    Interesting to see you look at creating your own vocab. It appears with BERT they used a special variant and code hasn't been made available nor exact details of what was run. In your cheatsheet i've found reference to your use of Google's SentencePiece through the python wrapper by the looks of thing. I wondered if you had any more specific details on preparation and post processing as the output by default won't include custom tokens nor does it use ## etc. Pretty good idea how you have likely done most of it but would be nice to know for sure. Also in the cheatsheet the command used set the vocab to 31K not 30K and the length of your vocab files differs from those of BERT too.

    Could you also comment on the significance of the "[unused15]" kind of entries as the base BERT vocab has 994 of them where as you only have 100. I havent found any details on what these are used for and what created them.

    More significantly (maybe) i was curious as to whether you had looked at any preprocessing (additional tokenisation) before running SentencePiece. The reason being that the BERT tokeniser does whitespace and punctuation splitting before word piece tokenisation over the resulting tokens. It looks as though this could part of what WordPiece does as well, based on the occurrence of lots of single character entries (with and without ## prefixed) and the lack of entries with punctuation characters in them along with letters or digits. For example you have entries like "(1997),", where as the BERT one doesnt have anything like this (with the exception of symbol characters not classes as punctuation). One issue with these entries as they stand is that when you apply the BERT tokeniser as part of task training you are never going to use this entry, so even though you have a ~30K vocab size a portion of it will not be used and therefor the neural model is going to have unused capacity. There is also the potential to change the number of possible occurrence of UNK tokens resulting from the tokenisation steps.

    Following on from that there is a question as to whether this has possible impacted (negatively or positively) the results you have seen as a possible side effect of this difference. So any more information on what may also have been tried in this area would be of interest to hear.

    Thanks

    Tony

    opened by antonyscerri 7
  • No weights.tar.gz in the models

    No weights.tar.gz in the models

    Hello,

    Following the instructions on how to set uo and run SciBert, I stumbled upon the problem that the file weights.tar.gz isn't present in pre-trained models. However, without this file the entire execution pipeline breaks.....

    opened by dalevskaya 6
  • Why use of CNN char embedding?

    Why use of CNN char embedding?

    "To keep things simple, we use minimal task specific architectures atop BERT-Base and SCIBERT embeddings. Each token is represented as the concatenation of its BERT embedding with a CNN-based character embedding. If the token has multiple BERT subword units, we use the first one."

    Why the use of an additional CNN-based char embeddings? Many (most?) papers using BERT (or similar) solely use the embedding coming out of the LM-based model.

    Was there a big additional uptick from layering in the CNN-based char embeddings?

    opened by cbockman 6
  • Pre-training parameters

    Pre-training parameters

    Hi,

    I'm currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt.

    @ibeltagy Could you confirm that these parameters are up-to-date 🤔

    Loss seems to be fine, but I'm just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.

    opened by stefan-it 5
  • SciBert Checkpoints not compatible with Bert

    SciBert Checkpoints not compatible with Bert

    As written in the github repo of scibert ,we can use the scibert checkpoints for Tensorflow by plugging them in the Bert code on google-research/bert ( https://github.com/google-research/bert ) but when i tried doing the same I got the following data loss error -

    DataLossError: Unable to open table file /content/scibert_data/scibert_scivocab_uncased/bert_model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

    opened by PradyumnaGupta 4
  • Generation of scivocab

    Generation of scivocab

    Hi scibert-team,

    I have one question regarding to the generation of the own scivocab.

    In the paper, it was mentioned that SentencePiece was used. Could you provide the parameters you used for generating the vocab?

    How did you managed the conversion from the annotation to ##, which is used in the BERT tokenizer 🤔 I would really like to train an own BERT model, but the vocab generation seems a bit complicated...

    Thanks in advance,

    Stefan

    opened by stefan-it 4
  • How to Predict NER

    How to Predict NER

    There only have a command that told us how to train models. Could you provide some predict commands, like how do i use the trained ner model to get predict tags? Thank you very much.

    opened by lizaigaoge550 4
  • How to get Sentence embedding using pre-trained SciBERT weights?

    How to get Sentence embedding using pre-trained SciBERT weights?

    The instructions to use SciBERT say this

    SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

    However, to use BERT, the instructions include loading BERT from tf.hub. But I don't see SciBert on tfhub, so I am unable to figure out how to get a sentence embedding for SciBert.

    This is my attempt in Python. So far I cloned the repository and loaded the weights, but I don't know how to get the sentence/paragraph vector.

    This is my code so far

    import sys
    import tensorflow as tf
    
    !test -d SciBert_repo || git clone https://github.com/allenai/scibert SciBert_repo 
    if not 'SciBert_repo ' in sys.path:
      sys.path += ['SciBert_repo ']
    
    import extract_features
    
    with tf.Session(graph=graph) as session:
     
       saver.restore(session, 'SciBert_repo.ckpt' )
    

    I know this is based on the original BERT code. For regular BERT they have you use tf.hub, but I'm guessing the setup is pretty similar. This is my code for regular BERT

    pip install bert-tensorflow
    
    import tensorflow as tf
    import tensorflow_hub as hub
    
    import bert
    from bert import run_classifier
    from bert import optimization
    from bert import tokenization
    
    import pandas as pd
    
    from tensorflow import keras
    import os
    import re
    
    from tensorflow.keras import backend as K
    
    from bert.tokenization import FullTokenizer
    
    bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
    
    sess = tf.Session()
    
    bert_module = hub.Module(
      bert_path,
      trainable=True)
    
    #Basically this is a function to convert text into a format BERT understands
    def bertInputsFromText(text):
    .
    .
    .
    
    bert_inputs = bertInputsFromText("This is a test sentence")
    
    sentence_embedding= bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)[
            "pooled_output"
    

    So I'm guessing my question boils down to, what to use as an equivalent to bert_module

    opened by Santosh-Gupta 4
  • Running inference for PICO tasks

    Running inference for PICO tasks

    I have my own data set consisting a a few hundred abstracts and I want to see baseline performance using Sci-BERT's PICO functionality. Are there code snip bits for easily running inference on your own dataset and just seeing how it classifies? I tried to use the huggingface models and the AWS deploy code but Im not sure how to use if for a PICO task, let alone interpret the text classification outputs it gives, which just seem to be "LABEL_0" or "LABEL_1". Any help on this would be much appreciated!

    opened by SLK121 0
  • Using SciBERT on GIEC report

    Using SciBERT on GIEC report

    Hi everybody,

    We are a group of uni students currently working on a research project within a NLP class. We extracted text from a GIEC report and build a Knowledge Graph from it. We wanted to know if it was possible to use a pre-trained version of SciBERT and train it with our own KG.

    As we are pretty new to NLP, we don't have any idea of the feasibility of this, or how to start with ?

    Would any of you have any indication ?

    Thanks, Hugo M

    opened by CSHugoM 0
  • Updates the model API

    Updates the model API

    Old API model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased') raises exception therefore updating it to model = AutoModelWithLMHead.from_pretrained('allenai/scibert_scivocab_uncased')

    opened by ghltshubh 0
  • no N/A label in relation classification

    no N/A label in relation classification

    I noticed that in the datasets of relation classification, there is no entity pair of 'N/A' label (no relation between two entities). In the other words, entities without any relations are not considered in training and inference. For example, as I check in the dataset of SCIIE, I find only the six relations are considered: EVALUATE-FOR, PART-OF, USED-FOR, FEATURE-OF, CONJUNCTION, COMPARE, HYPONYM-OF. So basically in the evaluation, SciBERT takes a pair of entities that is assumed to to have an existing relation and classifies.

    But obviously SCIIE have a lot of entity pairs that do not have any of the six relations, and they are not considered at inference time in this case. This is very different from other works that extract relations with gold entities, which do not assume a pair of input entities would have an existing relation beforehand. Specifically, they make inference for all possible pairs of entities, where the relation is classified over ( defined relations types + N/A relation) .

    Could you clarify why N/A relation is not included in the training and the evaluation of SciBERT? Or am I missing something?

    opened by yyzhuang1991 0
  • Domain specific terms

    Domain specific terms

    Hi,
    I want to pretrain SciBERT using additional data, and I want to enlarge the vocabulary with 100 additional "domain-specific" terms which are reserved for such usage. So I've figured out a way to extract a list of terms from my data.

    Let's supposed I have the following most relevant terms:

    "polymer"
    "materials"
    "chemistry"
    "polymers"
    

    what should I do with the terms such as polymer and polymers? Include them both? or keep the singular only?

    Does anybody have information or recommendation on this?

    opened by lfoppiano 0
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 3, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 831 Feb 17, 2021
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 5, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

null 42 Dec 31, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022