SciFive: a text-text transformer model for biomedical literature

Long Phan

Last update: Dec 24, 2022

Related tags

Deep Learning nlp pubmed transformer attention transfer-learning pretrained-models pmc biomedical-language

Overview

SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

SciFive Pubmed+PMC Base: gs://scifive/models/pubmed_pmc/base
SciFive Pubmed+PMC Large: gs://scifive/models/pubmed_pmc/large
SciFive Pubmed Base: gs://scifive/models/pubmed/base
SciFive Pubmed Large: gs://scifive/models/pubmed/large
SciFive PMC Base: gs://scifive/models/pmc/base
SciFive PMC Large: gs://scifive/models/pmc/large

gsutil URI for Pretrain data:

Pubmed: gs://scifive/pretrain/pubmed
PMC: gs://scifive/pretrain/pmc

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

HuggingFace

SciFive Pubmed+PMC: Base | Large
SciFive Pubmed: Base | Large
SciFive PMC: Base | Large

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊 Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

the authentication error

Hello, it appears that your DDI example (SciFive/finetune/re/ddi_1.ipynb) contain some errors. Could you please double-check that? For example, the authentication error in the line 'tensorflow gcs config.configure gcs from colab auth()' produces the following error:

Setting up GCS access...
Running on TPU: grpc://10.20.166.202:8470
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-3-8279b3908dbb>](https://localhost:8080/#) in <module>()
     16   auth.authenticate_user()
     17   tf.config.experimental_connect_to_host(TPU_ADDRESS)
---> 18   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     19 
     20 tf.disable_v2_behavior()

[/usr/local/lib/python3.7/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    130   adc_filename = os.environ.get(
    131       "GOOGLE_APPLICATION_CREDENTIALS", "/content/adc.json")
--> 132   with open(adc_filename) as f:
    133     data = json.load(f)
    134   return configure_gcs(credentials=data, device=device)

FileNotFoundError: [Errno 2] No such file or directory: '/content/adc.json'

Please let us know if you have any suggestion to fix the issue. Thank you in advance.

opened by jeonge1 6

Vocab is not accessible, it is in gs://t5-data

Hello, first of all, I want to say nice work!

When I want to reproduce your results on chemprot, I notice the following auth issue in the code

model.finetune(
    mixture_or_task_name="re_all",
    pretrained_model_dir=PRETRAINED_DIR,
    finetune_steps=FINETUNE_STEPS
)

2022-11-30 14:46:16.639835: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".

Turns out that this is caused by not being able to find vocab which is in 'gs://t5-data/vocabs/cc_all.32000/sentencepiece.model'. But currently only gs://scifive is accessible.

Could you please release the vocab or share with us how exactly did you obtain the sentencepiece vocab so that we can reproduce the results? Thank you!

opened by cnut1648 4

SciFive pre-training not using the init checkpoint

I am a PhD student trying to use your model for a research project.

Looking at the pre-training notebooks, it seems you do not use an init checkpoint to continue the training of the t5 model. Is this because you already have checkpoints in your model directory or because you train t5 from zero instead of using an already pre-trained T5 model?

opened by JorgeGabin 4
About the question_answer?

How can I test the model in the QA task? Just input the text like " How many teeth do humans have?" or if i need to add the prefix,like "QA: How many teeth do humans have?"

opened by ZYuliang 4
pubmed-pmc-large version seems wrong

Thanks for sharing the models. I found that the pubmed-pmc-large version (https://huggingface.co/razent/SciFive-large-Pubmed_PMC/tree/main) seems wrong as the fine-tuning results on MEDNLI are drastically worse than T5-large (acc 0.67+ vs 0.82+). However, the pubmed-large version gives good results.

Could you confirm this version?

opened by qiuhaolu 2
Weights for MedNLI

Is it possible to share the fine-tuned MedNLI classifier with a simple code snippet for how to perform inference given a text and premise pair? I saw the notebook for fine-tuning but was wondering if the output could be open sourced. many thanks

opened by griff4692 1
Need Config file for tensorflow checkpoint to Pytorch .bin model conversion (SciFive)

Hi there,

we want to convert your tensorflow model to pytorch model. We need the config file to do this, however we did not find the config file from this repo, https://console.cloud.google.com/storage/browser/scifive/models?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

could you please share the config file also?

opened by Mihir3009 1
Do the gs: files in the code still exist?

I am trying out the code in scifive_pretrain_base.ipynb.

I got OSError: Unable to open file for all the gs: files in the code 'gs://t5_training/t5-data/config/pretrained_models_google_base_operative_config.gin' gs://mindxhack/bio_sentence_piece_small.txt

I tried look them up using the google cloud storage browser and don't see these files.

The browser does find the model files like gs://scifive/models/pubmed_pmc/base

So the question is whether this is working code as is. Do these dependent files still exist on the cloud?

opened by bhomass 2

hugging face models do not work

Hi, Thanks for your great contribution in biomedical domain. I tried all the models in the hugging face format and I couldn't replicate any of the results or even get a reasonable output. Is there something wrong with the code, model, or anything is missing?

I run the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-PMC")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-PMC")
model.to(device)
sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text =  "ncbi_ner: " + sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

And this is the output:

ncbi_ner: ncbi_ner: ncbi_ner:

The expected output (based on the paper) should be as follow:

Identification of APC2 , a homologue of the entity* adenomatous polyposis coli tumour *entity suppressor .

I replaced the model with all other available large, base, pubmed, pmc, pubmed+pmc models (basically all 6 hugging face variations) but I didn't get any reasonable outputs.

Could you give me a solution?

opened by MHDBST 4

SciFive: a text-text transformer model for biomedical literature

Related tags

Overview

SciFive

Google Cloud Storage

gsutil URI for 6 SciFive models:

gsutil URI for Pretrain data:

Example

HuggingFace

Datasets

📊 Expected Results

Citations

Comments

Owner

Long Phan

Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Code for "Multi-Compound Transformer for Accurate Biomedical Image Segmentation"

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

CBKH: The Cornell Biomedical Knowledge Hub

Rethinking the U-Net architecture for multimodal biomedical image segmentation

Using pretrained language models for biomedical knowledge graph completion.

MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"