A package to predict protein inter-residue geometries from sequence data

Overview

trRosetta

This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicted inter-residue orientations. It includes tools to predict protein inter-residue geometries from a multiple sequence alignment or a single sequence.

Contact: Ivan Anishchenko, [email protected]

Updates

trRosetta2 is now available (May 20, 2021)

Requirements

tensorflow (tested on versions 1.13 and 1.14)

Download

# download package
git clone https://github.com/gjoni/trRosetta
cd trRosetta

# download pre-trained network
wget https://files.ipd.uw.edu/pub/trRosetta/model2019_07.tar.bz2
tar xf model2019_07.tar.bz2

Usage

python ./network/predict.py -m ./model2019_07 example/T1001.a3m example/T1001.npz

Links

References

J Yang, I Anishchenko, H Park, Z Peng, S Ovchinnikov, D Baker. Improved protein structure prediction using predicted inter-residue orientations. (2020) PNAS. 117(3): 1496-1503

Comments
  • GPU acceleration

    GPU acceleration

    Hi, If you use GPU to accelerate predict process, what is the acceleration effect? I used Tesla T4, tensorflow-1,14-gpu,tested the consumption time of sess.run() in predict.py. The acceleration effect is only 1.5 times that of the CPU. Is this normal? How many times the acceleration should be achieved? Looking forward to your reply, thanks.

    opened by Gradie 7
  • Error when generating pdb file

    Error when generating pdb file

    I want to test the performance of trRosetta on a bunch of proteins I have, I got the .npz files from predict.py but when I tried to use trRosetta to output the predicted structure I got following error: PyRosetta-4 2020 [Rosetta PyRosetta4.Release.python36.ubuntu 2020.50+release.1295438cd4bd2be39c9dbbfab8db669ab62415ab 2020-12-12T00:30:01] retrieved from: http://www.pyrosetta.org (C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team. temp folder: /dev/shm/64q8sw5m dist restraints: 8485 omega restraints: 8437 theta restraints: 17282 phi restraints: 17292 mutation: G26A mutation: G73A mutation: G81A mutation: G89A mutation: G95A mutation: G117A mutation: G150A mutation: G151A

    ERROR: Pose::set_phi( Size const seqpos , Real const setting ): residue seqpos is not part of a protein, peptoid, or carbohydrate! ERROR:: Exit from: /home/benchmark/rosetta/source/src/core/pose/Pose.cc line: 1069 Traceback (most recent call last): File "/home/zyxue/Record/trRosetta/trRosetta.py", line 212, in main() File "/home/zyxue/Record/trRosetta/trRosetta.py", line 110, in main set_random_dihedral(pose) File "/home/zyxue/Record/trRosetta/utils_ros.py", line 152, in set_random_dihedral pose.set_phi(i,phi) RuntimeError:

    File: /home/benchmark/rosetta/source/src/core/pose/Pose.cc:1069 [ ERROR ] UtilityExitException ERROR: Pose::set_phi( Size const seqpos , Real const setting ): residue seqpos is not part of a protein, peptoid, or carbohydrate!

    It seems like there's some residues that are not protein but that's impossible since the sequence I used was directly generated from pdb structure and the sequence alignment and first step prediction of constraints went well, I wonder what does this error mean and if there's a way to fix it.

    opened by Joey-Xue 6
  • About training!!!

    About training!!!

    Hi, Ivan Can you relax the training code which contains data processing such as subsample and 300 residues limit? This confuse me a long time. Thanks!

    opened by Maikuraky 2
  • trRosetta takes an awful long time on large MSA's, how can I speed it up?

    trRosetta takes an awful long time on large MSA's, how can I speed it up?

    I'm trying to run trRosetta on about 3000 different proteins. For each protein I have generated an MSA file with HHblits.

    For the first few proteins everything goes smoothly, and it predicts the distogram/angles in about 1-3 minutes. However, at some point it tries to predict on one of the larger MSA's. The particular protein that it stalls on has a length of 1600 amino acids, and roughly 3500 aligned sequences.

    So far Rosetta has been running for about 1 hour on this sequence, and has printed the following:

    2020-08-06 15:38:16.462083: W tensorflow/core/framework/allocator.cc:124] Allocation of 4498921476 exceeds 10% of system memory.
    2020-08-06 15:38:24.750222: W tensorflow/core/framework/allocator.cc:124] Allocation of 4498921476 exceeds 10% of system memory.
    2020-08-06 16:46:23.006386: W tensorflow/core/framework/allocator.cc:124] Allocation of 4498921476 exceeds 10% of system memory.
    2020-08-06 16:46:23.006592: W tensorflow/core/framework/allocator.cc:124] Allocation of 4080654400 exceeds 10% of system memory.
    2020-08-06 16:46:30.095595: W tensorflow/core/framework/allocator.cc:124] Allocation of 4080654400 exceeds 10% of system memory.
    ./data/model2019_07/model.xaa - done
    

    I'm guessing this is some sort of issue where the whole thing isn't loaded into memory, but rather ends up in the swap file or something? which would probably explain why it ends up taking such an awful long time. However I have 32 GB of ram on this machine and it isn't really using more 11 GB for this, and furthermore everything seems to be done on the CPU as opposed to the GPU, is that normal for this code?

    Finally, one thing I have been wondering about, it seems like the model is loaded in everytime as the code is running right now, which takes quite a while and seems like a waste, I would imagine the model should only need to be loaded in once, and then just run on each of the MSA's to predict the distograms/angles, however it seems like the model being loaded depends on the specific MSA used as input, which would make it less trivial I guess?

    opened by tueboesen 2
  • Question about the distance histogram

    Question about the distance histogram

    Hi Ivan,

    Thanks for sharing the source code, it's a great job.

    For the distance histogram prediction, as the paper says: "The distance range (2 to 20 Å) is binned into 36 equally spaced segments, 0.5 Å each, plus one bin indicating that residues are not in contact." So the output shape of the network is LL37, and the last bin corresponds to the >20 Å distance.

    But when converting the distance distribution to energy potential, what the article says is "For the distance distribution, the probability value for the last bin, i.e., [19.5, 20], is used as a reference state ... ", which is contradictory to the definition of the last bin (>20 Å but not [19.5, 20]).

    Is it my misunderstanding or is there something wrong?

    opened by AndersJing 2
  • Glycine Residues

    Glycine Residues

    Thank you for providing this excellent resource!

    Can I ask a question about a minor detail: how did you define d, omega, theta and phi when one or both of the residues in a pair is a glycine? What is the network predicting in these cases?

    I could not find this in your paper or supplementary info - sorry if I have missed it!

    Many thanks!

    opened by HanneWhitt 2
  • Question about trRosetta running

    Question about trRosetta running

    Hi there.

    I want generate structure models from trRosetta. So, I typed below. python trRosetta/trRosetta.py trRosetta/example/T1008.npz trRosetta/example/T1008.fasta model.pdb

    But, the output is it

    temp folder: /dev/shm/3rsb8fq5 dist restraints: 2877 omega restraints: 2872 theta restraints: 5835 OSError: [Errno 28] No space left on device

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "trRosetta/trRosetta.py", line 212, in main() File "trRosetta/trRosetta.py", line 58, in main rst = gen_rst(npz,tmpdir,params) File "/tf/trRosetta/utils_ros.py", line 137, in gen_rst f.close() OSError: [Errno 28] No space left on device

    my shm folder has inode=7.9M, IUse%=1%, size=64M How can I fix it?

    opened by kch38896 1
  • Details on computational profile of the pipeline

    Details on computational profile of the pipeline

    Hi --

    I'm interested in understanding the computational profile of these kinds of structure prediction tasks.

    As I understand it, there are three broad steps to go from input to 3d structure: a) "feature engineering" via MSAs, etc to form LxLx526 input tensor b) forward pass through the CNN to predict distances/angles c) structure determination to realize an actual 3d structure

    Are you able to comment on the relative cost in terms of wall-clock time / compute required / etc of those three steps? Is one step substantially more costly than the others? Do you have a sense of how long it would take to go from raw input -> 3d structure for a single example?

    (I'm primarily concerned w/ inference-time costs here -- I see in your paper that training the CNN takes ~ 9 days on a single GPU.)

    Thanks!

    opened by bkj 1
  • how to generate npz file

    how to generate npz file

    Hello!

    I have a naive question. I don't really understand the following process in the README file:

    Using the generated MSA, predict the distance and orientations by running the scripts at: https://github.com/gjoni/trRosetta

    I have an MSA in a3m format. Which scripts shall I use to predict distance and orientations ?

    Thanks in advance for any help.

    Best,

    Anupam

    opened by anu-bioinfo 0
  • utils.py - setting an array element with a sequence

    utils.py - setting an array element with a sequence

    Hi there,

    When I tried to run the predict.py I get the following error.

    File "/mnt/vdf/pepBuilder/trRosetta/trRosetta/network/utils.py", line 22, in parse_a3m msa = np.array([list(s) for s in seqs], dtype='|S1').view(np.uint8) ValueError: setting an array element with a sequence

    Should this line 22 be changed like the following

    msa = np.array([list(seqs) for s in seqs], dtype='|S1').view(np.uint8)

    Do you have an explanation for it.

    Thanks in advance.

    Kapila

    opened by kapilaGIT 0
Owner
Ivan Anishchenko
protein structure prediction, computational biology
Ivan Anishchenko
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
Deep functional residue identification

DeepFRI Deep functional residue identification Citing @article {Gligorijevic2019, author = {Gligorijevic, Vladimir and Renfrew, P. Douglas and Koscio

Flatiron Institute 156 Dec 25, 2022
Sequence lineage information extracted from RKI sequence data repo

Pango lineage information for German SARS-CoV-2 sequences This repository contains a join of the metadata and pango lineage tables of all German SARS-

Cornelius Roemer 24 Oct 26, 2022
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 8, 2022
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

Sunbow Liu 22 Nov 25, 2022
[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fudan Zhang Vision Group 897 Jan 5, 2023
Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

Sandeep Subramanian 708 Dec 19, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

Luke Tonin 195 Dec 17, 2022
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 8, 2023
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 9, 2023
git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

null 37 Dec 4, 2022
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

Anirudh S Chakravarthy 6 May 3, 2022
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 7, 2022
The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection Updates | Introduction | Results | Usage | Citation |

null 33 Jan 5, 2023
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022