SciFive
SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.
Google Cloud Storage
Our base Google Cloud Storage URI is at gs://scifive
As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.
Instruction on access Cloud Storage from the command line with python library gsutil is described here
gsutil URI for 6 SciFive models:
- SciFive Pubmed+PMC Base: gs://scifive/models/pubmed_pmc/base
- SciFive Pubmed+PMC Large: gs://scifive/models/pubmed_pmc/large
- SciFive Pubmed Base: gs://scifive/models/pubmed/base
- SciFive Pubmed Large: gs://scifive/models/pubmed/large
- SciFive PMC Base: gs://scifive/models/pmc/base
- SciFive PMC Large: gs://scifive/models/pmc/large
gsutil URI for Pretrain data:
- Pubmed: gs://scifive/pretrain/pubmed
- PMC: gs://scifive/pretrain/pmc
Example
Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()
sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text = f"mednli: sentence1: {sent_1} sentence2: {sent_2}"
encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=8,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
HuggingFace
Datasets
All of the finetune dataset already pre-procossed into text-text format also availabe at this
đ
Expected Results
Citations
If you use SciFive model or our code for publications, please cite:
@misc{phan2021scifive,
title={SciFive: a text-to-text transformer model for biomedical literature},
author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
year={2021},
eprint={2106.03598},
archivePrefix={arXiv},
primaryClass={cs.CL}
}