SciFive: a text-text transformer model for biomedical literature

Overview

SciFive

PWC PWC PWC PWC PWC PWC PWC PWC PWC

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

gsutil URI for Pretrain data:

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

HuggingFace

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊   Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Comments
  • the authentication error

    the authentication error

    Hello, it appears that your DDI example (SciFive/finetune/re/ddi_1.ipynb) contain some errors. Could you please double-check that? For example, the authentication error in the line 'tensorflow gcs config.configure gcs from colab auth()' produces the following error:

    Setting up GCS access...
    Running on TPU: grpc://10.20.166.202:8470
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    [<ipython-input-3-8279b3908dbb>](https://localhost:8080/#) in <module>()
         16   auth.authenticate_user()
         17   tf.config.experimental_connect_to_host(TPU_ADDRESS)
    ---> 18   tensorflow_gcs_config.configure_gcs_from_colab_auth()
         19 
         20 tf.disable_v2_behavior()
    
    [/usr/local/lib/python3.7/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
        130   adc_filename = os.environ.get(
        131       "GOOGLE_APPLICATION_CREDENTIALS", "/content/adc.json")
    --> 132   with open(adc_filename) as f:
        133     data = json.load(f)
        134   return configure_gcs(credentials=data, device=device)
    
    FileNotFoundError: [Errno 2] No such file or directory: '/content/adc.json'
    

    Please let us know if you have any suggestion to fix the issue. Thank you in advance.

    opened by jeonge1 6
  • Vocab is not accessible, it is in gs://t5-data

    Vocab is not accessible, it is in gs://t5-data

    Hello, first of all, I want to say nice work!

    When I want to reproduce your results on chemprot, I notice the following auth issue in the code

    model.finetune(
        mixture_or_task_name="re_all",
        pretrained_model_dir=PRETRAINED_DIR,
        finetune_steps=FINETUNE_STEPS
    )
    
    2022-11-30 14:46:16.639835: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
    

    Turns out that this is caused by not being able to find vocab which is in 'gs://t5-data/vocabs/cc_all.32000/sentencepiece.model'. But currently only gs://scifive is accessible.

    Could you please release the vocab or share with us how exactly did you obtain the sentencepiece vocab so that we can reproduce the results? Thank you!

    opened by cnut1648 4
  • SciFive pre-training not using the init checkpoint

    SciFive pre-training not using the init checkpoint

    I am a PhD student trying to use your model for a research project.

    Looking at the pre-training notebooks, it seems you do not use an init checkpoint to continue the training of the t5 model. Is this because you already have checkpoints in your model directory or because you train t5 from zero instead of using an already pre-trained T5 model?

    opened by JorgeGabin 4
  • About the question_answer?

    About the question_answer?

    How can I test the model in the QA task? Just input the text like " How many teeth do humans have?" or if i need to add the prefix,like "QA: How many teeth do humans have?"

    opened by ZYuliang 4
  • pubmed-pmc-large version seems wrong

    pubmed-pmc-large version seems wrong

    Thanks for sharing the models. I found that the pubmed-pmc-large version (https://huggingface.co/razent/SciFive-large-Pubmed_PMC/tree/main) seems wrong as the fine-tuning results on MEDNLI are drastically worse than T5-large (acc 0.67+ vs 0.82+). However, the pubmed-large version gives good results.

    Could you confirm this version?

    opened by qiuhaolu 2
  • Weights for MedNLI

    Weights for MedNLI

    Is it possible to share the fine-tuned MedNLI classifier with a simple code snippet for how to perform inference given a text and premise pair? I saw the notebook for fine-tuning but was wondering if the output could be open sourced. many thanks

    opened by griff4692 1
  • Need Config file for tensorflow checkpoint to Pytorch .bin model conversion (SciFive)

    Need Config file for tensorflow checkpoint to Pytorch .bin model conversion (SciFive)

    Hi there,

    we want to convert your tensorflow model to pytorch model. We need the config file to do this, however we did not find the config file from this repo, https://console.cloud.google.com/storage/browser/scifive/models?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

    could you please share the config file also?

    opened by Mihir3009 1
  • Do the gs: files in the code still exist?

    Do the gs: files in the code still exist?

    I am trying out the code in scifive_pretrain_base.ipynb.

    I got OSError: Unable to open file for all the gs: files in the code 'gs://t5_training/t5-data/config/pretrained_models_google_base_operative_config.gin' gs://mindxhack/bio_sentence_piece_small.txt

    I tried look them up using the google cloud storage browser and don't see these files.

    The browser does find the model files like gs://scifive/models/pubmed_pmc/base

    So the question is whether this is working code as is. Do these dependent files still exist on the cloud?

    opened by bhomass 2
  • hugging face models do not work

    hugging face models do not work

    Hi, Thanks for your great contribution in biomedical domain. I tried all the models in the hugging face format and I couldn't replicate any of the results or even get a reasonable output. Is there something wrong with the code, model, or anything is missing?

    I run the following code:

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-PMC")  
    model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-PMC")
    model.to(device)
    sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
    text =  "ncbi_ner: " + sentence + " </s>"
    
    encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
    input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    
    outputs = model.generate(
        input_ids=input_ids, attention_mask=attention_masks,
        max_length=256,
        early_stopping=True
    )
    
    for output in outputs:
        line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        print(line)
    
    

    And this is the output:

    ncbi_ner: ncbi_ner: ncbi_ner:

    The expected output (based on the paper) should be as follow:

    Identification of APC2 , a homologue of the entity* adenomatous polyposis coli tumour *entity suppressor .

    I replaced the model with all other available large, base, pubmed, pmc, pubmed+pmc models (basically all 6 hugging face variations) but I didn't get any reasonable outputs.

    Could you give me a solution?

    opened by MHDBST 4
Owner
Long Phan
A Computer Science student at Case Western Reserve University
Long Phan
Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

Healthcare Intelligence Laboratory 1.3k Jan 3, 2023
Code for "Multi-Compound Transformer for Accurate Biomedical Image Segmentation"

News The code of MCTrans has been released. if you are interested in contributing to the standardization of the medical image analysis community, plea

null 97 Jan 5, 2023
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 4, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
CBKH: The Cornell Biomedical Knowledge Hub

Cornell Biomedical Knowledge Hub (CBKH) CBKG integrates data from 18 publicly available biomedical databases. The current version of CBKG contains a t

null 44 Dec 21, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 5, 2023
Using pretrained language models for biomedical knowledge graph completion.

LMs for biomedical KG completion This repository contains code to run the experiments described in: Scientific Language Models for Biomedical Knowledg

Rahul Nadkarni 41 Nov 30, 2022
MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

MVGCN MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks. Developer: Fu Hait

null 13 Dec 1, 2022
Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation The skip connections in U-Net pass features from the levels of enc

Boheng Cao 1 Dec 29, 2021
U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

U-Net Implementation By Christopher Ley This is my interpretation and implementation of the famous paper "U-Net: Convolutional Networks for Biomedical

Christopher Ley 1 Jan 6, 2022
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk NeuhÀuser 6 Dec 8, 2022
Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

El Bruno 3 Mar 30, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
Alex Pashevich 62 Dec 24, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022