Replication attempt for the Protein Folding Model

Eric Alcaide

Last update: Nov 29, 2022

Related tags

Deep Learning rgn2-replica

Overview

RGN2-Replica (WIP)

To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding for particular use when no evolutionary homologs are available (ie. for protein design).

Install

$ pip install rgn2-replica

To load sample dataset

from datasets import load_from_disk
ds = load_from_disk("data/ur90_small")
print(ds['train'][0])

To convert to pandas for exploration

df = ds['train'].to_pandas()
df.sample(5)

To train ProteinLM

Run the following command with default parameters

python -m scripts.lmtrainer

This will start the run using sample dataset in repo directory on CPU.

TO-DO LIST: ordered by priority

Contribute:

Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request.

Currently the main discussions / conversation about the model development is happening in this discord server under the /self-supervised-learning channel.

Clone this project to your computer:

git clone https://github.com/EricAlcaide/pysimplechain

Please, follow this guideline on open source contribtuion

Citations:

@article {Chowdhury2021.08.02.454840,
    author = {Chowdhury, Ratul and Bouatta, Nazim and Biswas, Surojit and Rochereau, Charlotte and Church, George M. and Sorger, Peter K. and AlQuraishi, Mohammed},
    title = {Single-sequence protein structure prediction using language models from deep learning},
    elocation-id = {2021.08.02.454840},
    year = {2021},
    doi = {10.1101/2021.08.02.454840},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840},
    eprint = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840.full.pdf},
    journal = {bioRxiv}
}

@article{alquraishi_2019,
	author={AlQuraishi, Mohammed},
	title={End-to-End Differentiable Learning of Protein Structure},
	volume={8},
	DOI={10.1016/j.cels.2019.03.006},
	URL={https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6}
	number={4},
	journal={Cell Systems},
	year={2019},
	pages={292-301.e3}

Comments

amino Bret langugae model loss

Loss Function for AminoBret

you can test it by running following

vocab_size = 24
bs = 20
logit_out = torch.rand(bs, 10, vocab_size)
logit_chunk_perm = torch.rand(bs, 2)
target = torch.randint(1, 20, (bs, 10))
chunk_perm = torch.randint(0, 2, (bs,))

loss_func = AminoBretLoss(vocab_size=vocab_size)

loss = loss_func(logit_out, logit_chunk_perm, target, chunk_perm)
print(loss)

opened by DrHB 3

Add structure refinement in PyRosetta
Adding two functions for protein structure refinement in PyRosetta:

quick_refine: refines full-atom structure in cartesian coordinates using MinMover

followed by idealization of bond lengths and angles

relax_refine: refines full-atom structure in both cartesian and internal coordinates using FastRelax

structures are minimized before and after relaxation with MinMover (similar to quick_refine)

followed by idealization of bond lengths and angles

To use these functions, obtain a license and install PyRosetta from here. Before calling either function, you must initialize PyRosetta like:

import pyrosetta pyrosetta.init("-mute all")

PyRosetta is quite verbose so I usually mute outputs, but that can be removed for debugging.
opened by jeffreyruffolo 0

AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

I want to train the rgn2-replica model with the following command:

export WANDB_MODE=offline
python scripts/train_rgn2.py \
    --device cuda:1 \
    --wb_entity xxxxx \
    --wb_proj rgn2 \
    --run_name RGN2X_vanillaLSTM_full_run \
    --min_len_valid 0 
    --xray 1

But an AttributeError occurred:

Traceback (most recent call last):
  File "scripts/train_rgn2.py", line 345, in <module>
    init_and_train(args)
  File "scripts/train_rgn2.py", line 134, in init_and_train
    results = run_train_schedule(dataloaders, embedder, config, args)
  File "scripts/train_rgn2.py", line 213, in run_train_schedule
    config=config,
  File "/home/lipan/usr/rgn2-replica/rgn2_replica/rgn2_trainers.py", line 432, in train
    config=config,
  File "/home/lipan/usr/rgn2-replica/rgn2_replica/rgn2_trainers.py", line 62, in batched_inference
    coords_rebuilt = mp_nerf.proteins.ca_bb_fold( ca_trace ) # beware extremes
AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

It seems that mp_nerf.proteins.ca_bb_fold is not defined in https://github.com/eleutherAI/mp_nerf

opened by lipan6461188 0

Support IPA-based refiner

What?

Instead of En-transformer refiner, IPA (https://github.com/lucidrains/invariant-point-attention) refiner has been integrated.

Why?

To see the effectiveness of different experiments. IPA has lied in the core of AF2 structure module. It is worthy to give a try.

How?

Simply follow existing RGN2_transformer implementation. The difference is output of coordinates instead of angles. However, the compatibility of angles (used by losses) has been done via discrete Frenet-Serret equations.

Testing?

dRMSD can reach around 5. Experimental hyper-para's include "structure_module_depth" and other IPA parameters. Running script has been included.

Screenshots

At the moment the xtension branch of mp_nerf repo has been integrated via source code. Some functions (e.g. mp_nerf.ml_utils.backbone_forcefield()) has not been implemented. Not a big deal. I leave it for further improvement.

Anything Else?

Integrate LSTM output of 4 angles has been discussed but not implemented. LSTM can give nice prediction of secondary structures. Worthy to give a trial.

opened by hushuangwei 0

Implementing AlphaFold-based protein refinement

Usage example:

sys.argv = [
    'scripts/rgn2_predict_fold.py',
    
    '--pdb_input', '/home/hypnopump/bruba/rgn2-replica/data/input/0_lolasso.pdb',
    # --- OR ---
#     '--input', '/home/hypnopump/own_research/notebooks/test_predictor/proteins.fasta',
#     '--model', '/home/hypnopump/own_research/notebooks/rgn2_models/restart_improve_vanillaLSTM_improve@_150K.pt',
    
    '--bidirectional', 'True',
    '--output_path', '/home/hypnopump/bruba/rgn2-replica/data/output',
    '--device', 'cpu',
    '--af2_refine', 'True'
]

from scripts.rgn2_predict_fold import predict_refine
result = predict_refine()

opened by Serhiy-Shekhovtsov 0

Random notes
Prediction of angular features:

See this: https://discord.com/channels/729741769192767510/797547607345201162/874987318069592095

Types of recurrent layers

multiplicativeLSTM: https://florianwilhelm.info/2018/08/multiplicative_LSTM_for_sequence_based_recos/

layerNormLSTM: https://github.com/exe1023/LSTM_LN/blob/master/lstm.py

multiplicative Integration: https://papers.nips.cc/paper/2016/file/f69e505b08403ad2298b9f262659929a-Paper.pdf

other variants: https://github.com/FlorianMuellerklein/death_metal_lstm/blob/master/scripts/models.py

Novel Arch: https://arxiv.org/pdf/1911.11033.pdf

SRU (https://arxiv.org/abs/1709.02755): https://github.com/bamtercelboo/pytorch_SRU

peephole: https://towardsdatascience.com/building-a-lstm-by-hand-on-pytorch-59c02a4ec091

uRNN: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15842215.pdf

delayedLSTM: https://arxiv.org/pdf/1909.00021.pdf

osiparc: ?
opened by hypnopump 5

Owner

Eric Alcaide

Y el mayor bien es pequeño; que toda la vida es sueño, y los sueños, sueños son.

GitHub

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

19 May 6, 2022

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

ProGen - (wip) Implementation and replication of ProGen, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily

71 Dec 1, 2022

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

2 Jul 25, 2022

Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

51 Nov 22, 2022

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the ge

5k Jan 4, 2023

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

108 Nov 23, 2022

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

GLOM - Pytorch (wip) An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding,

173 Dec 14, 2022

An attempt at the implementation of GLOM, Geoffrey Hinton's paper for emergent part-whole hierarchies from data

GLOM TensorFlow This Python package attempts to implement GLOM in TensorFlow, which allows advances made by several different groups transformers, neu

32 Feb 21, 2022

Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

159 Dec 15, 2022

Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

[TensorFlow] Protein Interface Prediction using Graph Convolutional Networks Unofficial TensorFlow implementation of Protein Interface Prediction usin

9 Oct 25, 2022

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 9, 2021

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

24 Nov 23, 2022

A geometric deep learning pipeline for predicting protein interface contacts.

44 Dec 30, 2022

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

1 Nov 29, 2022

A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

185 Jan 7, 2023

A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP：A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

0 Mar 25, 2022

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

2 Dec 14, 2022

Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

74 Dec 7, 2022

Replication attempt for the Protein Folding Model

Related tags

Overview

RGN2-Replica (WIP)

Install

To load sample dataset

To train ProteinLM

TO-DO LIST: ordered by priority

Contribute:

Citations:

Comments

amino Bret langugae model loss

Add structure refinement in PyRosetta

AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

Support IPA-based refiner

What?

Why?

How?

Testing?

Screenshots

Anything Else?

Implementing AlphaFold-based protein refinement

Random notes

Owner

Eric Alcaide

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

Replication of Pix2Seq with Pretrained Model

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

An attempt at the implementation of GLOM, Geoffrey Hinton's paper for emergent part-whole hierarchies from data

Generative Models for Graph-Based Protein Design

Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

A geometric deep learning pipeline for predicting protein interface contacts.

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

A package to predict protein inter-residue geometries from sequence data

A Protein-RNA Interface Predictor Based on Semantics of Sequences

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Official implementation of "Generating 3D Molecules for Target Protein Binding"