SciBERT is a BERT model trained on scientific text.

AI2

Last update: Dec 24, 2022

Related tags

Overview

`SciBERT`

SciBERT is a BERT model trained on scientific text.

SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (basevocab) for comparison.
It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the paper. Evaluation code and data are included in this repo.

Downloading Trained Models

Update! SciBERT models now installable directly within Huggingface's framework under the allenai org:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library, and this repo shows how to use it in AllenNLP. All combinations of scivocab and basevocab, cased and uncased models are available below. Our evaluation shows that scivocab-uncased usually gives the best results.

Using SciBERT in your own model

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Training new models using AllenNLP

To run experiments on different tasks and reproduce our results in the paper, you need to first setup the Python 3.6 environment:

pip install -r requirements.txt

which will install dependencies like AllenNLP.

Use the scibert/scripts/train_allennlp_local.sh script as an example of how to run an experiment (you'll need to modify paths and variable names like TASK and DATASET).

We include a broad set of scientific nlp datasets under the data/ directory across the following tasks. Each task has a sub-directory of available datasets.

├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent
    ├── mag
    ├── rct-20k
    ├── sci-cite
    └── sciie-relation-extraction

For example to run the model on the Named Entity Recognition (NER) task and on the BC5CDR dataset (BioCreative V CDR), modify the scibert/train_allennlp_local.sh script according to:

DATASET='bc5cdr'
TASK='ner'
...

Decompress the PyTorch model that you downloaded using
tar -xvf scibert_scivocab_uncased.tar
The results will be in the scibert_scivocab_uncased directory containing two files: A vocabulary file (vocab.txt) and a weights file (weights.tar.gz). Copy the files to your desired location and then set correct paths for BERT_WEIGHTS and BERT_VOCAB in the script:

export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

Finally run the script:

./scibert/scripts/train_allennlp_local.sh [serialization-directory]

Where [serialization-directory] is the path to an output directory where the model files will be stored.

Citing

If you use SciBERT in your research, please cite SciBERT: Pretrained Language Model for Scientific Text.

@inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  year={2019},
  booktitle={EMNLP},
  Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments

Pretrained sciroBERTa weights release in the works?

Given the success of roBERTa https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ in GLUE benchmarks and alike, is a training of roBERTa over the semantic scholar corpus planned for release in the foreseeable future, in this repo?

Otherwise, can someone provide hints about how to train a roBERTa model on the semantic scholar corpus and the compute time needed for the purpose? Thanks!

opened by davidefiocco 16
TypeError: '_NamespacePath' object is not subscriptable

when I finished the previous work mentioned aboved, I encountered the problem: src/allennlp/allennlp/common/util.py line316, in import_submodules path_string = ' ' if not path else path[0] TypeError: '_NamespacePath' object is not subscriptable.....

what should I do to deal with the problem?? Thanks in advance

opened by Nicozwy 11
Pretraining SciBERT

Hi, The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!

Yichong

opened by xycforgithub 8
Question Over Punctuation Charts in Vocab Creation

Hi

Interesting to see you look at creating your own vocab. It appears with BERT they used a special variant and code hasn't been made available nor exact details of what was run. In your cheatsheet i've found reference to your use of Google's SentencePiece through the python wrapper by the looks of thing. I wondered if you had any more specific details on preparation and post processing as the output by default won't include custom tokens nor does it use ## etc. Pretty good idea how you have likely done most of it but would be nice to know for sure. Also in the cheatsheet the command used set the vocab to 31K not 30K and the length of your vocab files differs from those of BERT too.

Could you also comment on the significance of the "[unused15]" kind of entries as the base BERT vocab has 994 of them where as you only have 100. I havent found any details on what these are used for and what created them.

More significantly (maybe) i was curious as to whether you had looked at any preprocessing (additional tokenisation) before running SentencePiece. The reason being that the BERT tokeniser does whitespace and punctuation splitting before word piece tokenisation over the resulting tokens. It looks as though this could part of what WordPiece does as well, based on the occurrence of lots of single character entries (with and without ## prefixed) and the lack of entries with punctuation characters in them along with letters or digits. For example you have entries like "(1997),", where as the BERT one doesnt have anything like this (with the exception of symbol characters not classes as punctuation). One issue with these entries as they stand is that when you apply the BERT tokeniser as part of task training you are never going to use this entry, so even though you have a ~30K vocab size a portion of it will not be used and therefor the neural model is going to have unused capacity. There is also the potential to change the number of possible occurrence of UNK tokens resulting from the tokenisation steps.

Following on from that there is a question as to whether this has possible impacted (negatively or positively) the results you have seen as a possible side effect of this difference. So any more information on what may also have been tried in this area would be of interest to hear.

Thanks

Tony

opened by antonyscerri 7
No weights.tar.gz in the models

Hello,

Following the instructions on how to set uo and run SciBert, I stumbled upon the problem that the file weights.tar.gz isn't present in pre-trained models. However, without this file the entire execution pipeline breaks.....

opened by dalevskaya 6
Why use of CNN char embedding?

"To keep things simple, we use minimal task specific architectures atop BERT-Base and SCIBERT embeddings. Each token is represented as the concatenation of its BERT embedding with a CNN-based character embedding. If the token has multiple BERT subword units, we use the first one."

Why the use of an additional CNN-based char embeddings? Many (most?) papers using BERT (or similar) solely use the embedding coming out of the LM-based model.

Was there a big additional uptick from layering in the CNN-based char embeddings?

opened by cbockman 6
Pre-training parameters

Hi,

I'm currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt.

@ibeltagy Could you confirm that these parameters are up-to-date 🤔

Loss seems to be fine, but I'm just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.

opened by stefan-it 5
SciBert Checkpoints not compatible with Bert

As written in the github repo of scibert ,we can use the scibert checkpoints for Tensorflow by plugging them in the Bert code on google-research/bert ( https://github.com/google-research/bert ) but when i tried doing the same I got the following data loss error -

DataLossError: Unable to open table file /content/scibert_data/scibert_scivocab_uncased/bert_model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

opened by PradyumnaGupta 4
Generation of scivocab

Hi scibert-team,

I have one question regarding to the generation of the own scivocab.

In the paper, it was mentioned that SentencePiece was used. Could you provide the parameters you used for generating the vocab?

How did you managed the conversion from the ▁ annotation to ##, which is used in the BERT tokenizer 🤔 I would really like to train an own BERT model, but the vocab generation seems a bit complicated...

Thanks in advance,

Stefan

opened by stefan-it 4
How to Predict NER

There only have a command that told us how to train models. Could you provide some predict commands, like how do i use the trained ner model to get predict tags? Thank you very much.

opened by lizaigaoge550 4

How to get Sentence embedding using pre-trained SciBERT weights?

The instructions to use SciBERT say this

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

However, to use BERT, the instructions include loading BERT from tf.hub. But I don't see SciBert on tfhub, so I am unable to figure out how to get a sentence embedding for SciBert.

This is my attempt in Python. So far I cloned the repository and loaded the weights, but I don't know how to get the sentence/paragraph vector.

This is my code so far

import sys
import tensorflow as tf

!test -d SciBert_repo || git clone https://github.com/allenai/scibert SciBert_repo 
if not 'SciBert_repo ' in sys.path:
  sys.path += ['SciBert_repo ']

import extract_features

with tf.Session(graph=graph) as session:
 
   saver.restore(session, 'SciBert_repo.ckpt' )

I know this is based on the original BERT code. For regular BERT they have you use tf.hub, but I'm guessing the setup is pretty similar. This is my code for regular BERT

pip install bert-tensorflow

import tensorflow as tf
import tensorflow_hub as hub

import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

import pandas as pd

from tensorflow import keras
import os
import re

from tensorflow.keras import backend as K

from bert.tokenization import FullTokenizer

bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

sess = tf.Session()

bert_module = hub.Module(
  bert_path,
  trainable=True)

#Basically this is a function to convert text into a format BERT understands
def bertInputsFromText(text):
.
.
.

bert_inputs = bertInputsFromText("This is a test sentence")

sentence_embedding= bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)[
        "pooled_output"

So I'm guessing my question boils down to, what to use as an equivalent to bert_module

opened by Santosh-Gupta 4

Running inference for PICO tasks

I have my own data set consisting a a few hundred abstracts and I want to see baseline performance using Sci-BERT's PICO functionality. Are there code snip bits for easily running inference on your own dataset and just seeing how it classifies? I tried to use the huggingface models and the AWS deploy code but Im not sure how to use if for a PICO task, let alone interpret the text classification outputs it gives, which just seem to be "LABEL_0" or "LABEL_1". Any help on this would be much appreciated!

opened by SLK121 0
Using SciBERT on GIEC report

Hi everybody,

We are a group of uni students currently working on a research project within a NLP class. We extracted text from a GIEC report and build a Knowledge Graph from it. We wanted to know if it was possible to use a pre-trained version of SciBERT and train it with our own KG.

As we are pretty new to NLP, we don't have any idea of the feasibility of this, or how to start with ?

Would any of you have any indication ?

Thanks, Hugo M

opened by CSHugoM 0
Updates the model API

Old API model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased') raises exception therefore updating it to model = AutoModelWithLMHead.from_pretrained('allenai/scibert_scivocab_uncased')

opened by ghltshubh 0
no N/A label in relation classification

I noticed that in the datasets of relation classification, there is no entity pair of 'N/A' label (no relation between two entities). In the other words, entities without any relations are not considered in training and inference. For example, as I check in the dataset of SCIIE, I find only the six relations are considered: EVALUATE-FOR, PART-OF, USED-FOR, FEATURE-OF, CONJUNCTION, COMPARE, HYPONYM-OF. So basically in the evaluation, SciBERT takes a pair of entities that is assumed to to have an existing relation and classifies.

But obviously SCIIE have a lot of entity pairs that do not have any of the six relations, and they are not considered at inference time in this case. This is very different from other works that extract relations with gold entities, which do not assume a pair of input entities would have an existing relation beforehand. Specifically, they make inference for all possible pairs of entities, where the relation is classified over ( defined relations types + N/A relation) .

Could you clarify why N/A relation is not included in the training and the evaluation of SciBERT? Or am I missing something?

opened by yyzhuang1991 0
Domain specific terms
Hi,
I want to pretrain SciBERT using additional data, and I want to enlarge the vocabulary with 100 additional "domain-specific" terms which are reserved for such usage. So I've figured out a way to extract a list of terms from my data.

Let's supposed I have the following most relevant terms:

"polymer" "materials" "chemistry" "polymers"

what should I do with the terms such as polymer and polymers? Include them both? or keep the singular only?

Does anybody have information or recommendation on this?
opened by lfoppiano 0

SciBERT is a BERT model trained on scientific text.

Related tags

Overview

SciBERT

Downloading Trained Models

Tensorflow Models

PyTorch AllenNLP Models

PyTorch HuggingFace Models

Using SciBERT in your own model

Training new models using AllenNLP

Citing

Comments

Owner

AI2

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

A full spaCy pipeline and models for scientific/biomedical documents.

A full spaCy pipeline and models for scientific/biomedical documents.

AI-powered literature discovery and review engine for medical/scientific papers

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Text Classification in Turkish Texts with Bert

Two-stage text summarization with BERT and BART

`SciBERT`