RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

LightOn

Last update: Dec 22, 2022

Related tags

Overview

RITA: a Study on Scaling Up Generative Protein Sequence Models

RITA is a family of autoregressive protein models, developed by a collaboration of Lighton, the OATML group at Oxford, and the Debbie Marks Lab at Harvard.

Model	#Params	d_model	layers	lm loss uniref-100
Small	85M	768	12	2.31
Medium	300M	1024	24	2.01
Large	680M	1536	24	1.82
XLarge	1.2B	2048	24	1.70

Results

For full results see our preprint: https://arxiv.org/abs/2205.05789

Usage

Instantiate a model like so:

from transformers import AutoModel, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s, trust_remote_code=True")
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")

for generation we support pipelines:

from transformers import pipeline
rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
sequences = rita_gen("MAB", max_length=20, do_sample=True, top_k=950, repetition_penalty=1.2, 
                     num_return_sequences=2, eos_token_id=2)
for seq in sequences:
    print(f"seq: {seq['generated_text'].replace(' ', '')}")

Or see example.py

How to cite

@article{hesslow2022rita,
  title={RITA: a Study on Scaling Up Generative Protein Sequence Models},
  author={Hesslow, Daniel and Zanichelli, Niccol{\'o} and Notin, Pascal and Poli, Iacopo and Marks, Debora},
  journal={arXiv preprint arXiv:2205.05789},
  year={2022}
}

Comments

Cannot run pipeline to generate sequences

Dear Author,

I followed the example code using pipeline: rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer) sequences = rita_gen("MAB", max_length=20, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=2, eos_token_id=2)

I got: The model 'RITAModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'B......'

My transformers version is 4.22.0.dev0.

Please help. Thanks.

opened by yzhang-github-pub 2
ESM's log_likelihood equivalent calculation for RITA?

Hi, is there a way to calculate the equivalent of ESM's log_likelihood (protein stability, a.k.a fitness) for a protein sequence using this repository? Thanks in advance.

opened by avilella 2

cannot run example

Hi,

I'm trying to run your example, but seeing

[/root/.cache/huggingface/modules/transformers_modules/lightonai/RITA_s/rita_modeling_ad8dfe2e2240d5a6abcfa9ea52f3868a35f07f4670c887c31d420f6bba0fdc5f_3db732a66a256d53b183fe3a079bbfbf2db700c63f7789178c37924666055f28.py](https://localhost:8080/#) in <module>()
     18 from transformers.utils import logging
     19 
---> 20 from .rita_configuration import RITAConfig
     21 import torch.nn.functional as F
     22 logger = logging.get_logger(__name__)

ModuleNotFoundError: No module named 'transformers_modules.lightonai.RITA_s.rita_configuration'

opened by lucidrains 2

attention_mask in RITA is diffenrent from it in other Huggingface transformer models

The attention_mask in Huggingface transformer models is padding masking, where 1 means not masked and 0 means masked. (https://huggingface.co/docs/transformers/main/en/glossary#attention-mask) I think in RITA model it is named as padding_mask.

However, RITAModelForCausalLM class does not have the argument padding_mask. Instead, it passes attention_mask argument into DecoderLayer class like this: x = layer(x, attn_mask=attention_mask), where attn_mask in RITA is the causual mask instead of padding mask. This will lead to unexpected result.

So, I think it might be better to keep the same API as other Huggingface transformer models. i.e. let users pass the Huggingface-style attention_mask into the model as padding mask.

Thank you!

opened by zzhongzz 2
What is training input data format?

Dear Author,

I am fine-tuning your pretrained RITA with a protein family data, using run_clm.py script @ huggingface. I tried this format where seq1 & seq2 are protein sequences 1 & 2 without white space:

seq1 <|endoftext|> seq2 <|endoftext|> ...

and also this format: seq1 seq2 ...

Training seemed to be successful. However, sequences generated by the fine tuned model contain a lot of '' tokens.

Please advise. Thanks.

opened by yzhang-github-pub 1

Cannot find model

Hi! I'm trying to run the example and have met the following errors:

>>> model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s, trust_remote_code=True")
Traceback (most recent call last):
  File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/configuration_utils.py", line 601, in _get_config_dict
    resolved_config_file = cached_path(
  File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 284, in cached_path
    output_path = get_from_cache(
  File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 495, in get_from_cache
    _raise_for_status(r)
  File "/data/home/zhongkai/miniconda/envs/protein_ssf_aws/lib/python3.8/site-packages/transformers/utils/hub.py", line 417, in _raise_for_status
    raise RepositoryNotFoundError(
transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/lightonai/RITA_s,%20trust_remote_code=True/resolve/main/config.json. If the repo is private, make sure you are authenticated.

What am I doing wrong here? Appreciate the help, thanks!

opened by zzk1st 1

Unable to run Example Script

Unable to run a similar script to the example

>>> from transformers import pipeline
>>> from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
>>> model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s", trust_remote_code=True)
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
>>> rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
The model 'RITAModelForCausalLM' is not supported for text-generation. Supported models are ['XGLMForCausalLM', 'PLBartForCausalLM', 'QDQBertLMHeadModel', 'TrOCRForCausalLM', 'GPTJForCausalLM', 'RemBertForCausalLM', 'RoFormerForCausalLM', 'BigBirdPegasusForCausalLM', 'GPTNeoForCausalLM', 'BigBirdForCausalLM', 'CamembertForCausalLM', 'XLMRobertaXLForCausalLM', 'XLMRobertaForCausalLM', 'RobertaForCausalLM', 'BertLMHeadModel', 'OpenAIGPTLMHeadModel', 'GPT2LMHeadModel', 'TransfoXLLMHeadModel', 'XLNetLMHeadModel', 'XLMWithLMHeadModel', 'ElectraForCausalLM', 'CTRLLMHeadModel', 'ReformerModelWithLMHead', 'BertGenerationDecoder', 'XLMProphetNetForCausalLM', 'ProphetNetForCausalLM', 'BartForCausalLM', 'OPTForCausalLM', 'MBartForCausalLM', 'PegasusForCausalLM', 'MarianForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'MegatronBertForCausalLM', 'Speech2Text2ForCausalLM', 'Data2VecTextForCausalLM'].

This is with Python3.8 and HF Transformers tokenizers-0.12.1 transformers-4.19.2

Additionally, are there more details about your prompt tuning? Curious to know how you approached it and what prompt engineering looks like for proteins as opposed to language.

opened by zanussbaum 1

How to obtain perplexity evaluation datasets?

Dear Author,

Thanks for releasing the RITA for protein generation! However, I wonder how can I obtain perplexity evalutation datasets used in your paper and how to calculate perplexity. Hope for your suggestions. Thanks in advance!

opened by LGH1gh 0

Owner

LightOn

At LightOn, we unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. LightOnMuse connects to foundation models

GitHub

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

2 Jul 25, 2022

CS50x-AI - Artificial Intelligence with Python from Harvard University

CS50x-AI Artificial Intelligence with Python from Harvard University ?? Table of

6 Aug 22, 2022

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking We revisit and address issues with Oxford 5k and Paris 6k image retrieval benchm

188 Dec 17, 2022

A trusty face recognition research platform developed by Tencent Youtu Lab

Introduction TFace: A trusty face recognition research platform developed by Tencent Youtu Lab. It provides a high-performance distributed training fr

956 Jan 1, 2023

YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

YOLOv5 ?? is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research int

73 Dec 16, 2022

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

2 Dec 14, 2022

Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

159 Dec 15, 2022

Uni-Fold: Training your own deep protein-folding models

Uni-Fold: Training your own deep protein-folding models. This package provides an implementation of a trainable, Transformer-based deep protein foldin

187 Jan 4, 2023

Multi-tool reverse engineering collaboration solution.

CollaRE v0.3 Intorduction CollareRE is a tool for collaborative reverse engineering that aims to allow teams that do need to use more then one tool du

105 Nov 27, 2022

Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration

CoGAIL Table of Content Overview Installation Dataset Training Evaluation Trained Checkpoints Acknowledgement Citations License Overview This reposito

29 Dec 24, 2022

Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

Anytime Autoregressive Model Anytime Sampling for Autoregressive Models via Ordered Autoencoding , ICLR 21 Yilun Xu, Yang Song, Sahaj Gara, Linyuan Go

22 Sep 8, 2022

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.

Attack-Probabilistic-Models This is the source code for Adversarial Attacks on Probabilistic Autoregressive Forecasting Models. This repository contai

25 Sep 14, 2022

Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

GANVAS-models This is an implementation of various generative models. It contains implementations of the following: Autoregressive Models: PixelCNN, G

MRSAIL (Mini Robotics, Software & AI Lab)

6 Nov 26, 2022

Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Group-CAM By Zhang, Qinglong and Rao, Lu and Yang, Yubin [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the o

98 Nov 16, 2022

BC3407-Group-5-Project - BC3407 Group Project With Python

BC3407-Group-5-Project As the world struggles to contain the ever-changing varia

1 Jan 26, 2022

Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression

Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression YOLOv5 with alpha-IoU losses implemented in PyTorch. Example r

147 Dec 5, 2022

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

torch-imle Concise and self-contained PyTorch library implementing the I-MLE gradient estimator proposed in our NeurIPS 2021 paper Implicit MLE: Backp

249 Jan 3, 2023

The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

MOTIF Dataset The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled

112 Dec 13, 2022

unet-family: Ultimate version

unet-family: Ultimate version 基于之前my-unet代码，我整理出来了这一份终极版本unet-family，方便其他人阅读。相比于之前的my-unet代码，代码分类更加规范，有条理对于clone下来的代码不需要修改各种复杂繁琐的路径问题，直接就可以运行。并且代码有

2 Sep 19, 2022