RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

Overview

RITA: a Study on Scaling Up Generative Protein Sequence Models

GitHub license Twitter

RITA is a family of autoregressive protein models, developed by a collaboration of Lighton, the OATML group at Oxford, and the Debbie Marks Lab at Harvard.

Model #Params d_model layers lm loss uniref-100
Small 85M 768 12 2.31
Medium 300M 1024 24 2.01
Large 680M 1536 24 1.82
XLarge 1.2B 2048 24 1.70

Results

For full results see our preprint: https://arxiv.org/abs/2205.05789

Usage

Instantiate a model like so:

from transformers import AutoModel, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s, trust_remote_code=True")
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")

for generation we support pipelines:

from transformers import pipeline
rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
sequences = rita_gen("MAB", max_length=20, do_sample=True, top_k=950, repetition_penalty=1.2, 
                     num_return_sequences=2, eos_token_id=2)
for seq in sequences:
    print(f"seq: {seq['generated_text'].replace(' ', '')}")

Or see example.py

How to cite

@article{hesslow2022rita,
  title={RITA: a Study on Scaling Up Generative Protein Sequence Models},
  author={Hesslow, Daniel and Zanichelli, Niccol{\'o} and Notin, Pascal and Poli, Iacopo and Marks, Debora},
  journal={arXiv preprint arXiv:2205.05789},
  year={2022}
}
Comments
  • Cannot run pipeline to generate sequences

    Cannot run pipeline to generate sequences

    Dear Author,

    I followed the example code using pipeline: rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer) sequences = rita_gen("MAB", max_length=20, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=2, eos_token_id=2)

    I got: The model 'RITAModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'B......'

    My transformers version is 4.22.0.dev0.

    Please help. Thanks.

    opened by yzhang-github-pub 2
  • ESM's log_likelihood equivalent calculation for RITA?

    ESM's log_likelihood equivalent calculation for RITA?

    Hi, is there a way to calculate the equivalent of ESM's log_likelihood (protein stability, a.k.a fitness) for a protein sequence using this repository? Thanks in advance.

    opened by avilella 2
  • cannot run example

    cannot run example

    Hi,

    I'm trying to run your example, but seeing

    [/root/.cache/huggingface/modules/transformers_modules/lightonai/RITA_s/rita_modeling_ad8dfe2e2240d5a6abcfa9ea52f3868a35f07f4670c887c31d420f6bba0fdc5f_3db732a66a256d53b183fe3a079bbfbf2db700c63f7789178c37924666055f28.py](https://localhost:8080/#) in <module>()
         18 from transformers.utils import logging
         19 
    ---> 20 from .rita_configuration import RITAConfig
         21 import torch.nn.functional as F
         22 logger = logging.get_logger(__name__)
    
    ModuleNotFoundError: No module named 'transformers_modules.lightonai.RITA_s.rita_configuration'
    
    opened by lucidrains 2
  • attention_mask in RITA is diffenrent from it in other Huggingface transformer models

    attention_mask in RITA is diffenrent from it in other Huggingface transformer models

    The attention_mask in Huggingface transformer models is padding masking, where 1 means not masked and 0 means masked. (https://huggingface.co/docs/transformers/main/en/glossary#attention-mask) I think in RITA model it is named as padding_mask.

    However, RITAModelForCausalLM class does not have the argument padding_mask. Instead, it passes attention_mask argument into DecoderLayer class like this: x = layer(x, attn_mask=attention_mask), where attn_mask in RITA is the causual mask instead of padding mask. This will lead to unexpected result.

    So, I think it might be better to keep the same API as other Huggingface transformer models. i.e. let users pass the Huggingface-style attention_mask into the model as padding mask.

    Thank you!

    opened by zzhongzz 2
  • What is training input data format?

    What is training input data format?

    Dear Author,

    I am fine-tuning your pretrained RITA with a protein family data, using run_clm.py script @ huggingface. I tried this format where seq1 & seq2 are protein sequences 1 & 2 without white space:

    seq1 <|endoftext|> seq2 <|endoftext|> ...

    and also this format: seq1 seq2 ...

    Training seemed to be successful. However, sequences generated by the fine tuned model contain a lot of '' tokens.

    Please advise. Thanks.

    opened by yzhang-github-pub 1
  • Cannot find model

    Cannot find model

    Hi! I'm trying to run the example and have met the following errors:

    >>> model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s, trust_remote_code=True")
    Traceback (most recent call last):
      File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/configuration_utils.py", line 601, in _get_config_dict
        resolved_config_file = cached_path(
      File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 284, in cached_path
        output_path = get_from_cache(
      File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 495, in get_from_cache
        _raise_for_status(r)
      File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 417, in _raise_for_status
        raise RepositoryNotFoundError(
    transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/lightonai/RITA_s,%20trust_remote_code=True/resolve/main/config.json. If the repo is private, make sure you are authenticated.
    

    What am I doing wrong here? Appreciate the help, thanks!

    opened by zzk1st 1
  • Unable to run Example Script

    Unable to run Example Script

    Unable to run a similar script to the example

    >>> from transformers import pipeline
    >>> from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
    >>> model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s", trust_remote_code=True)
    Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
    Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
    >>> rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
    The model 'RITAModelForCausalLM' is not supported for text-generation. Supported models are ['XGLMForCausalLM', 'PLBartForCausalLM', 'QDQBertLMHeadModel', 'TrOCRForCausalLM', 'GPTJForCausalLM', 'RemBertForCausalLM', 'RoFormerForCausalLM', 'BigBirdPegasusForCausalLM', 'GPTNeoForCausalLM', 'BigBirdForCausalLM', 'CamembertForCausalLM', 'XLMRobertaXLForCausalLM', 'XLMRobertaForCausalLM', 'RobertaForCausalLM', 'BertLMHeadModel', 'OpenAIGPTLMHeadModel', 'GPT2LMHeadModel', 'TransfoXLLMHeadModel', 'XLNetLMHeadModel', 'XLMWithLMHeadModel', 'ElectraForCausalLM', 'CTRLLMHeadModel', 'ReformerModelWithLMHead', 'BertGenerationDecoder', 'XLMProphetNetForCausalLM', 'ProphetNetForCausalLM', 'BartForCausalLM', 'OPTForCausalLM', 'MBartForCausalLM', 'PegasusForCausalLM', 'MarianForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'MegatronBertForCausalLM', 'Speech2Text2ForCausalLM', 'Data2VecTextForCausalLM'].
    

    This is with Python3.8 and HF Transformers tokenizers-0.12.1 transformers-4.19.2

    Additionally, are there more details about your prompt tuning? Curious to know how you approached it and what prompt engineering looks like for proteins as opposed to language.

    opened by zanussbaum 1
  • How to obtain perplexity evaluation datasets?

    How to obtain perplexity evaluation datasets?

    Dear Author,

    Thanks for releasing the RITA for protein generation! However, I wonder how can I obtain perplexity evalutation datasets used in your paper and how to calculate perplexity. Hope for your suggestions. Thanks in advance!

    opened by LGH1gh 0
Owner
LightOn
At LightOn, we unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. LightOnMuse connects to foundation models
LightOn
A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

null 2 Jul 25, 2022
CS50x-AI - Artificial Intelligence with Python from Harvard University

CS50x-AI Artificial Intelligence with Python from Harvard University ?? Table of

Hosein Damavandi 6 Aug 22, 2022
Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking We revisit and address issues with Oxford 5k and Paris 6k image retrieval benchm

Filip Radenovic 188 Dec 17, 2022
A trusty face recognition research platform developed by Tencent Youtu Lab

Introduction TFace: A trusty face recognition research platform developed by Tencent Youtu Lab. It provides a high-performance distributed training fr

Tencent 956 Jan 1, 2023
YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

YOLOv5 ?? is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research int

阿才 73 Dec 16, 2022
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022
Uni-Fold: Training your own deep protein-folding models

Uni-Fold: Training your own deep protein-folding models. This package provides an implementation of a trainable, Transformer-based deep protein foldin

DP Technology 187 Jan 4, 2023
Multi-tool reverse engineering collaboration solution.

CollaRE v0.3 Intorduction CollareRE is a tool for collaborative reverse engineering that aims to allow teams that do need to use more then one tool du

null 105 Nov 27, 2022
Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration

CoGAIL Table of Content Overview Installation Dataset Training Evaluation Trained Checkpoints Acknowledgement Citations License Overview This reposito

Jeremy Wang 29 Dec 24, 2022
Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

Anytime Autoregressive Model Anytime Sampling for Autoregressive Models via Ordered Autoencoding , ICLR 21 Yilun Xu, Yang Song, Sahaj Gara, Linyuan Go

Yilun Xu 22 Sep 8, 2022
Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.

Attack-Probabilistic-Models This is the source code for Adversarial Attacks on Probabilistic Autoregressive Forecasting Models. This repository contai

SRI Lab, ETH Zurich 25 Sep 14, 2022
Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

GANVAS-models This is an implementation of various generative models. It contains implementations of the following: Autoregressive Models: PixelCNN, G

MRSAIL (Mini Robotics, Software & AI Lab) 6 Nov 26, 2022
Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Group-CAM By Zhang, Qinglong and Rao, Lu and Yang, Yubin [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the o

zhql 98 Nov 16, 2022
BC3407-Group-5-Project - BC3407 Group Project With Python

BC3407-Group-5-Project As the world struggles to contain the ever-changing varia

null 1 Jan 26, 2022
Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression

Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression YOLOv5 with alpha-IoU losses implemented in PyTorch. Example r

Jacobi(Jiabo He) 147 Dec 5, 2022
Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

torch-imle Concise and self-contained PyTorch library implementing the I-MLE gradient estimator proposed in our NeurIPS 2021 paper Implicit MLE: Backp

UCL Natural Language Processing 249 Jan 3, 2023
The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

MOTIF Dataset The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled

Booz Allen Hamilton 112 Dec 13, 2022
unet-family: Ultimate version

unet-family: Ultimate version 基于之前my-unet代码,我整理出来了这一份终极版本unet-family,方便其他人阅读。 相比于之前的my-unet代码,代码分类更加规范,有条理 对于clone下来的代码不需要修改各种复杂繁琐的路径问题,直接就可以运行。 并且代码有

null 2 Sep 19, 2022