Replication attempt for the Protein Folding Model

Overview

RGN2-Replica (WIP)

To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding for particular use when no evolutionary homologs are available (ie. for protein design).

Install

$ pip install rgn2-replica

To load sample dataset

from datasets import load_from_disk
ds = load_from_disk("data/ur90_small")
print(ds['train'][0])

To convert to pandas for exploration

df = ds['train'].to_pandas()
df.sample(5)

To train ProteinLM

Run the following command with default parameters

python -m scripts.lmtrainer

This will start the run using sample dataset in repo directory on CPU.

TO-DO LIST: ordered by priority

  • Provide basic package and file structure

  • RGN2:

    • Contribute adaptation of RGN1 for different ops
      • Simple LSTM with:
        • Inputs (B, L, emb_dim)
        • Outputs (B, L, 4) (4 features which should be outputs of linear projections)
    • Find a good (and reproducible) training scheme
    • Benchmark regression vs classification of torsional alphabet
  • Language Model:

  • To be merged when first versions of RGN are ready:

    • Geometry module
    • Adapt functionality from MP-NeRF:
      • Sidechain building
      • Full backbone from CA
      • Fast loss functions and metrics
      • Modifications to convert LSTM cell into RGN cell
  • Contirbute trainer classes / functionality.

    • Sequence preprocessing for AminoBERT
      • inverted fragments
      • sequence masking
      • loss function wrapper v1 by @DrHB
      • Sample dataset by @gurvindersingh
      • Dataloder
      • ...
  • Contribute Data Infra for training:

  • Contribute Rosetta Scripts ( contact me by email ([email protected]) / discord to get a key for Rosetta if interested in doing this part. )

  • NOTES:

  • Use functionality provided in MP-NeRF wherever possible (avoid repetition).

Contribute:

Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request.

Currently the main discussions / conversation about the model development is happening in this discord server under the /self-supervised-learning channel.

Clone this project to your computer:

git clone https://github.com/EricAlcaide/pysimplechain

Please, follow this guideline on open source contribtuion

Citations:

@article {Chowdhury2021.08.02.454840,
    author = {Chowdhury, Ratul and Bouatta, Nazim and Biswas, Surojit and Rochereau, Charlotte and Church, George M. and Sorger, Peter K. and AlQuraishi, Mohammed},
    title = {Single-sequence protein structure prediction using language models from deep learning},
    elocation-id = {2021.08.02.454840},
    year = {2021},
    doi = {10.1101/2021.08.02.454840},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840},
    eprint = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840.full.pdf},
    journal = {bioRxiv}
}

@article{alquraishi_2019,
	author={AlQuraishi, Mohammed},
	title={End-to-End Differentiable Learning of Protein Structure},
	volume={8},
	DOI={10.1016/j.cels.2019.03.006},
	URL={https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6}
	number={4},
	journal={Cell Systems},
	year={2019},
	pages={292-301.e3}
Comments
  • amino Bret langugae model loss

    amino Bret langugae model loss

    Loss Function for AminoBret

    you can test it by running following

    vocab_size = 24
    bs = 20
    logit_out = torch.rand(bs, 10, vocab_size)
    logit_chunk_perm = torch.rand(bs, 2)
    target = torch.randint(1, 20, (bs, 10))
    chunk_perm = torch.randint(0, 2, (bs,))
    
    loss_func = AminoBretLoss(vocab_size=vocab_size)
    
    loss = loss_func(logit_out, logit_chunk_perm, target, chunk_perm)
    print(loss)
    
    
    opened by DrHB 3
  • Add structure refinement in PyRosetta

    Add structure refinement in PyRosetta

    Adding two functions for protein structure refinement in PyRosetta:

    quick_refine: refines full-atom structure in cartesian coordinates using MinMover

    • followed by idealization of bond lengths and angles

    relax_refine: refines full-atom structure in both cartesian and internal coordinates using FastRelax

    • structures are minimized before and after relaxation with MinMover (similar to quick_refine)
    • followed by idealization of bond lengths and angles

    To use these functions, obtain a license and install PyRosetta from here. Before calling either function, you must initialize PyRosetta like:

    import pyrosetta
    pyrosetta.init("-mute all")
    

    PyRosetta is quite verbose so I usually mute outputs, but that can be removed for debugging.

    opened by jeffreyruffolo 0
  • AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

    AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

    I want to train the rgn2-replica model with the following command:

    export WANDB_MODE=offline
    python scripts/train_rgn2.py \
        --device cuda:1 \
        --wb_entity xxxxx \
        --wb_proj rgn2 \
        --run_name RGN2X_vanillaLSTM_full_run \
        --min_len_valid 0 
        --xray 1
    

    But an AttributeError occurred:

    Traceback (most recent call last):
      File "scripts/train_rgn2.py", line 345, in <module>
        init_and_train(args)
      File "scripts/train_rgn2.py", line 134, in init_and_train
        results = run_train_schedule(dataloaders, embedder, config, args)
      File "scripts/train_rgn2.py", line 213, in run_train_schedule
        config=config,
      File "/home/lipan/usr/rgn2-replica/rgn2_replica/rgn2_trainers.py", line 432, in train
        config=config,
      File "/home/lipan/usr/rgn2-replica/rgn2_replica/rgn2_trainers.py", line 62, in batched_inference
        coords_rebuilt = mp_nerf.proteins.ca_bb_fold( ca_trace ) # beware extremes
    AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'
    

    It seems that mp_nerf.proteins.ca_bb_fold is not defined in https://github.com/eleutherAI/mp_nerf

    opened by lipan6461188 0
  • Support IPA-based refiner

    Support IPA-based refiner

    What?

    Instead of En-transformer refiner, IPA (https://github.com/lucidrains/invariant-point-attention) refiner has been integrated.

    Why?

    To see the effectiveness of different experiments. IPA has lied in the core of AF2 structure module. It is worthy to give a try.

    How?

    Simply follow existing RGN2_transformer implementation. The difference is output of coordinates instead of angles. However, the compatibility of angles (used by losses) has been done via discrete Frenet-Serret equations.

    Testing?

    dRMSD can reach around 5. Experimental hyper-para's include "structure_module_depth" and other IPA parameters. Running script has been included.

    Screenshots

    At the moment the xtension branch of mp_nerf repo has been integrated via source code. Some functions (e.g. mp_nerf.ml_utils.backbone_forcefield()) has not been implemented. Not a big deal. I leave it for further improvement.

    Anything Else?

    Integrate LSTM output of 4 angles has been discussed but not implemented. LSTM can give nice prediction of secondary structures. Worthy to give a trial.

    opened by hushuangwei 0
  • Implementing AlphaFold-based protein refinement

    Implementing AlphaFold-based protein refinement

    Usage example:

    sys.argv = [
        'scripts/rgn2_predict_fold.py',
        
        '--pdb_input', '/home/hypnopump/bruba/rgn2-replica/data/input/0_lolasso.pdb',
        # --- OR ---
    #     '--input', '/home/hypnopump/own_research/notebooks/test_predictor/proteins.fasta',
    #     '--model', '/home/hypnopump/own_research/notebooks/rgn2_models/restart_improve_vanillaLSTM_improve@_150K.pt',
        
        '--bidirectional', 'True',
        '--output_path', '/home/hypnopump/bruba/rgn2-replica/data/output',
        '--device', 'cpu',
        '--af2_refine', 'True'
    ]
    
    from scripts.rgn2_predict_fold import predict_refine
    result = predict_refine()
    
    opened by Serhiy-Shekhovtsov 0
  • Random notes

    Random notes

    • Prediction of angular features:

      • See this: https://discord.com/channels/729741769192767510/797547607345201162/874987318069592095
    • Types of recurrent layers

      • multiplicativeLSTM: https://florianwilhelm.info/2018/08/multiplicative_LSTM_for_sequence_based_recos/
      • layerNormLSTM: https://github.com/exe1023/LSTM_LN/blob/master/lstm.py
      • multiplicative Integration: https://papers.nips.cc/paper/2016/file/f69e505b08403ad2298b9f262659929a-Paper.pdf
      • other variants: https://github.com/FlorianMuellerklein/death_metal_lstm/blob/master/scripts/models.py
      • Novel Arch: https://arxiv.org/pdf/1911.11033.pdf
      • SRU (https://arxiv.org/abs/1709.02755): https://github.com/bamtercelboo/pytorch_SRU
      • peephole: https://towardsdatascience.com/building-a-lstm-by-hand-on-pytorch-59c02a4ec091
      • uRNN: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15842215.pdf
      • delayedLSTM: https://arxiv.org/pdf/1909.00021.pdf
      • osiparc: ?
    opened by hypnopump 5
Owner
Eric Alcaide
Y el mayor bien es pequeño; que toda la vida es sueño, y los sueños, sueños son.
Eric Alcaide
Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

Phil Wang 19 May 6, 2022
Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

ProGen - (wip) Implementation and replication of ProGen, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily

Phil Wang 71 Dec 1, 2022
A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

null 2 Jul 25, 2022
Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

peng gao 51 Nov 22, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the ge

Phil Wang 5k Jan 4, 2023
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

Phil Wang 108 Nov 23, 2022
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 6, 2022
An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

GLOM - Pytorch (wip) An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding,

Phil Wang 173 Dec 14, 2022
An attempt at the implementation of GLOM, Geoffrey Hinton's paper for emergent part-whole hierarchies from data

GLOM TensorFlow This Python package attempts to implement GLOM in TensorFlow, which allows advances made by several different groups transformers, neu

Rishit Dagli 32 Feb 21, 2022
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022
Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

[TensorFlow] Protein Interface Prediction using Graph Convolutional Networks Unofficial TensorFlow implementation of Protein Interface Prediction usin

YeongHyeon Park 9 Oct 25, 2022
7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

null 8 Jul 9, 2021
Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

Tristan Croll 24 Nov 23, 2022
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

zhanglabNKU 1 Nov 29, 2022
A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

Ivan Anishchenko 185 Jan 7, 2023
A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP:A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

李优 0 Mar 25, 2022
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

DIVE Lab, Texas A&M University 74 Dec 7, 2022