Replication attempt for the Protein Folding Model


RGN2-Replica (WIP)

To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding for particular use when no evolutionary homologs are available (ie. for protein design).


$ pip install rgn2-replica

To load sample dataset

from datasets import load_from_disk
ds = load_from_disk("data/ur90_small")

To convert to pandas for exploration

df = ds['train'].to_pandas()

To train ProteinLM

Run the following command with default parameters

python -m scripts.lmtrainer

This will start the run using sample dataset in repo directory on CPU.

TO-DO LIST: ordered by priority

  • Provide basic package and file structure

  • RGN2:

    • Contribute adaptation of RGN1 for different ops
      • Simple LSTM with:
        • Inputs (B, L, emb_dim)
        • Outputs (B, L, 4) (4 features which should be outputs of linear projections)
    • Find a good (and reproducible) training scheme
    • Benchmark regression vs classification of torsional alphabet
  • Language Model:

  • To be merged when first versions of RGN are ready:

    • Geometry module
    • Adapt functionality from MP-NeRF:
      • Sidechain building
      • Full backbone from CA
      • Fast loss functions and metrics
      • Modifications to convert LSTM cell into RGN cell
  • Contirbute trainer classes / functionality.

    • Sequence preprocessing for AminoBERT
      • inverted fragments
      • sequence masking
      • loss function wrapper v1 by @DrHB
      • Sample dataset by @gurvindersingh
      • Dataloder
      • ...
  • Contribute Data Infra for training:

  • Contribute Rosetta Scripts ( contact me by email ( / discord to get a key for Rosetta if interested in doing this part. )

  • NOTES:

  • Use functionality provided in MP-NeRF wherever possible (avoid repetition).


Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request.

Currently the main discussions / conversation about the model development is happening in this discord server under the /self-supervised-learning channel.

Clone this project to your computer:

git clone

Please, follow this guideline on open source contribtuion


  • amino Bret langugae model loss

    amino Bret langugae model loss

    Loss Function for AminoBret

    you can test it by running following

    vocab_size = 24
    bs = 20
    logit_out = torch.rand(bs, 10, vocab_size)
    logit_chunk_perm = torch.rand(bs, 2)
    target = torch.randint(1, 20, (bs, 10))
    chunk_perm = torch.randint(0, 2, (bs,))
    loss_func = AminoBretLoss(vocab_size=vocab_size)
    loss = loss_func(logit_out, logit_chunk_perm, target, chunk_perm)
    opened by DrHB 3
  • Add structure refinement in PyRosetta

    Add structure refinement in PyRosetta

    Adding two functions for protein structure refinement in PyRosetta:

    quick_refine: refines full-atom structure in cartesian coordinates using MinMover

    • followed by idealization of bond lengths and angles

    relax_refine: refines full-atom structure in both cartesian and internal coordinates using FastRelax

    • structures are minimized before and after relaxation with MinMover (similar to quick_refine)
    • followed by idealization of bond lengths and angles

    To use these functions, obtain a license and install PyRosetta from here. Before calling either function, you must initialize PyRosetta like:

    import pyrosetta
    pyrosetta.init("-mute all")

    PyRosetta is quite verbose so I usually mute outputs, but that can be removed for debugging.

    opened by jeffreyruffolo 0
  • AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

    AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

    I want to train the rgn2-replica model with the following command:

    export WANDB_MODE=offline
    python scripts/ \
        --device cuda:1 \
        --wb_entity xxxxx \
        --wb_proj rgn2 \
        --run_name RGN2X_vanillaLSTM_full_run \
        --min_len_valid 0 
        --xray 1

    But an AttributeError occurred:

    Traceback (most recent call last):
      File "scripts/", line 345, in <module>
      File "scripts/", line 134, in init_and_train
        results = run_train_schedule(dataloaders, embedder, config, args)
      File "scripts/", line 213, in run_train_schedule
      File "/home/lipan/usr/rgn2-replica/rgn2_replica/", line 432, in train
      File "/home/lipan/usr/rgn2-replica/rgn2_replica/", line 62, in batched_inference
        coords_rebuilt = mp_nerf.proteins.ca_bb_fold( ca_trace ) # beware extremes
    AttributeError: module 'mp_nerf.proteins' has no attribute 'ca_bb_fold'

    It seems that mp_nerf.proteins.ca_bb_fold is not defined in

    opened by lipan6461188 0
  • Support IPA-based refiner

    Support IPA-based refiner


    Instead of En-transformer refiner, IPA ( refiner has been integrated.


    To see the effectiveness of different experiments. IPA has lied in the core of AF2 structure module. It is worthy to give a try.


    Simply follow existing RGN2_transformer implementation. The difference is output of coordinates instead of angles. However, the compatibility of angles (used by losses) has been done via discrete Frenet-Serret equations.


    dRMSD can reach around 5. Experimental hyper-para's include "structure_module_depth" and other IPA parameters. Running script has been included.


    At the moment the xtension branch of mp_nerf repo has been integrated via source code. Some functions (e.g. mp_nerf.ml_utils.backbone_forcefield()) has not been implemented. Not a big deal. I leave it for further improvement.

    Anything Else?

    Integrate LSTM output of 4 angles has been discussed but not implemented. LSTM can give nice prediction of secondary structures. Worthy to give a trial.

    opened by hushuangwei 0
  • Implementing AlphaFold-based protein refinement

    Implementing AlphaFold-based protein refinement

    Usage example:

    sys.argv = [
        '--pdb_input', '/home/hypnopump/bruba/rgn2-replica/data/input/0_lolasso.pdb',
        # --- OR ---
    #     '--input', '/home/hypnopump/own_research/notebooks/test_predictor/proteins.fasta',
    #     '--model', '/home/hypnopump/own_research/notebooks/rgn2_models/',
        '--bidirectional', 'True',
        '--output_path', '/home/hypnopump/bruba/rgn2-replica/data/output',
        '--device', 'cpu',
        '--af2_refine', 'True'
    from scripts.rgn2_predict_fold import predict_refine
    result = predict_refine()
    opened by Serhiy-Shekhovtsov 0
  • Random notes

    Random notes

    • Prediction of angular features:

      • See this:
    • Types of recurrent layers

      • multiplicativeLSTM:
      • layerNormLSTM:
      • multiplicative Integration:
      • other variants:
      • Novel Arch:
      • SRU (
      • peephole:
      • uRNN:
      • delayedLSTM:
      • osiparc: ?
    opened by hypnopump 5
