RGN2-Replica (WIP)
To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding for particular use when no evolutionary homologs are available (ie. for protein design).
Install
$ pip install rgn2-replica
To load sample dataset
from datasets import load_from_disk
ds = load_from_disk("data/ur90_small")
print(ds['train'][0])
To convert to pandas for exploration
df = ds['train'].to_pandas()
df.sample(5)
To train ProteinLM
Run the following command with default parameters
python -m scripts.lmtrainer
This will start the run using sample dataset in repo directory on CPU.
TO-DO LIST: ordered by priority
-
Provide basic package and file structure -
RGN2:
-
Contribute adaptation of RGN1 for different ops-
Simple LSTM with:-
Inputs (B, L, emb_dim) -
Outputs (B, L, 4) (4 features which should be outputs of linear projections)
-
-
- Find a good (and reproducible) training scheme
-
Benchmark regression vs classification of torsional alphabet
-
-
Language Model:
-
basic tokenizerby @gurvindersingh -
basic architectureby @gurvindersingh - adapt for desired outputs
- find a combination of pretraining losses
-
-
To be merged when first versions of RGN are ready:
-
Geometry module -
Adapt functionality from MP-NeRF:-
Sidechain building -
Full backbone from CA -
Fast loss functions and metrics -
Modifications to convert LSTM cell into RGN cell
-
-
-
Contirbute trainer classes / functionality.
- Sequence preprocessing for AminoBERT
-
inverted fragments -
sequence masking -
loss function wrapper v1by @DrHB -
Sample datasetby @gurvindersingh - Dataloder
- ...
-
- Sequence preprocessing for AminoBERT
-
Contribute Data Infra for training:
- Sequences: UniParc sequences, etc @gurvindersingh
- Structures: will use the amazing sidechainnet work by Jonathan King
-
Contribute Rosetta Scripts ( contact me by email ([email protected]) / discord to get a key for Rosetta if interested in doing this part. )
-
NOTES:
-
Use functionality provided in MP-NeRF wherever possible (avoid repetition).
Contribute:
Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request.
Currently the main discussions / conversation about the model development is happening in this discord server under the /self-supervised-learning
channel.
Clone this project to your computer:
git clone https://github.com/EricAlcaide/pysimplechain
Please, follow this guideline on open source contribtuion
Citations:
@article {Chowdhury2021.08.02.454840,
author = {Chowdhury, Ratul and Bouatta, Nazim and Biswas, Surojit and Rochereau, Charlotte and Church, George M. and Sorger, Peter K. and AlQuraishi, Mohammed},
title = {Single-sequence protein structure prediction using language models from deep learning},
elocation-id = {2021.08.02.454840},
year = {2021},
doi = {10.1101/2021.08.02.454840},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840},
eprint = {https://www.biorxiv.org/content/early/2021/08/04/2021.08.02.454840.full.pdf},
journal = {bioRxiv}
}
@article{alquraishi_2019,
author={AlQuraishi, Mohammed},
title={End-to-End Differentiable Learning of Protein Structure},
volume={8},
DOI={10.1016/j.cels.2019.03.006},
URL={https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30076-6}
number={4},
journal={Cell Systems},
year={2019},
pages={292-301.e3}