Trainable PyTorch reproduction of AlphaFold 2

AQ Laboratory

Last update: Dec 29, 2022

Related tags

Deep Learning openfold

Overview

OpenFold

A faithful PyTorch reproduction of DeepMind's AlphaFold 2.

Features

OpenFold carefully reproduces (almost) all of the features of the original open source inference code. The sole exception is model ensembling, which fared poorly in DeepMind's own ablation testing and is being phased out in future DeepMind experiments. It is omitted here for the sake of reducing clutter. In cases where the Nature paper differs from the source, we always defer to the latter.

OpenFold is built to support inference with AlphaFold's original JAX weights. Try it out with our Colab notebook.

Unlike DeepMind's public code, OpenFold is also trainable. It can be trained with DeepSpeed and with mixed precision. bfloat16 training is not currently supported, but will be in the future.

Installation (Linux)

Python dependencies available through pip are provided in requirements.txt. OpenFold depends on openmm==7.5.1 and pdbfixer, which are only available via conda. For producing sequence alignments, you'll also need kalign, the HH-suite, and one of {jackhmmer, MMseqs2} installed on on your system. Finally, some download scripts require aria2c.

For convenience, we provide a script that installs Miniconda locally, creates a conda virtual environment, installs all Python dependencies, and downloads useful resources (including DeepMind's pretrained parameters). Run:

scripts/install_third_party_dependencies.sh

To activate the environment, run:

source scripts/activate_conda_env.sh

To deactivate it, run:

source scripts/deactivate_conda_env.sh

To install the HH-suite to /usr/bin, run

# scripts/install_hh_suite.sh

Usage

To download DeepMind's pretrained parameters and common ground truth data, run:

scripts/download_data.sh data/

You have two choices for downloading protein databases, depending on whether you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or ColabFold's, which uses the faster MMseqs2 instead. For the former, run:

scripts/download_alphafold_databases.sh data/

For the latter, run:

scripts/download_mmseqs_databases.sh data/    # downloads .tar files
scripts/prep_mmseqs_databases.sh data/        # unpacks and preps the databases

Make sure to run the latter command on the machine that will be used for MSA generation (the script estimates how the precomputed database index used by MMseqs2 should be split according to the memory available on the system).

Alternatively, you can use raw MSAs from ProteinNet. After downloading the database, use scripts/prepare_proteinnet_msas.py to convert the data into a format recognized by the OpenFold parser. The resulting directory becomes the alignment_dir used in subsequent steps. Use scripts/unpack_proteinnet.py to extract .core files from ProteinNet text files.

Inference

To run inference on a sequence or multiple sequences using a set of DeepMind's pretrained parameters, run e.g.:

python3 run_pretrained_openfold.py \
    target.fasta \
    data/uniref90/uniref90.fasta \
    data/mgnify/mgy_clusters_2018_12.fa \
    data/pdb70/pdb70 \
    data/pdb_mmcif/mmcif_files/ \
    data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --output_dir ./ \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device cuda:1 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

where data is the same directory as in the previous step. If jackhmmer, hhblits, hhsearch and kalign are available at the default path of /usr/bin, their binary_path command-line arguments can be dropped. If you've already computed alignments for the query, you have the option to circumvent the expensive alignment computation here.

Training

After activating the OpenFold environment with source scripts/activate_conda_env.sh, install OpenFold by running

python setup.py install

To train the model, you will first need to precompute protein alignments.

You have two options. You can use the same procedure DeepMind used by running the following:

python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
    data/uniref90/uniref90.fasta \
    data/mgnify/mgy_clusters_2018_12.fa \
    data/pdb70/pdb70 \
    data/pdb_mmcif/mmcif_files/ \
    data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus 16 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

As noted before, you can skip the binary_path arguments if these binaries are at /usr/bin. Expect this step to take a very long time, even for small numbers of proteins.

Alternatively, you can generate MSAs with the ColabFold pipeline (and templates with HHsearch) with:

python3 scripts/precompute_alignments_mmseqs.py input.fasta \
    data/mmseqs_dbs \
    uniref30_2103_db \
    alignment_dir \
    ~/MMseqs2/build/bin/mmseqs \
    /usr/bin/hhsearch \
    --env_db colabfold_envdb_202108_db
    --pdb70 data/pdb70/pdb70

where input.fasta is a FASTA file containing one or more query sequences. To generate an input FASTA from a directory of mmCIF and/or ProteinNet .core files, we provide scripts/data_dir_to_fasta.py.

Next, generate a cache of certain datapoints in the mmCIF files:

python3 scripts/generate_mmcif_cache.py \
    mmcif_dir/ \
    mmcif_cache.json \
    --no_workers 16

This cache is used to minimize the number of mmCIF parses performed during training-time data preprocessing. Finally, call the training script:

python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ \
    2021-10-10 \ 
    --template_release_dates_cache_path mmcif_cache.json \ 
    --precision 16 \
    --gpus 8 --replace_sampler_ddp=True \
    --seed 42 \ # in multi-gpu settings, the seed must be specified
    --deepspeed_config_path deepspeed_config.json \
    --resume_from_ckpt ckpt_dir/

where --template_release_dates_cache_path is a path to the .json file generated in the previous step. A suitable DeepSpeed configuration file can be generated with scripts/build_deepspeed_config.py. The training script is written with PyTorch Lightning and supports the full range of training options that entails, including multi-node distributed training. For more information, consult PyTorch Lightning documentation and the --help flag of the training script.

Testing

To run unit tests, use

scripts/run_unit_tests.sh

The script is a thin wrapper around Python's unittest suite, and recognizes unittest commands. E.g., to run a specific test verbosely:

scripts/run_unit_tests.sh -v tests.test_model

Certain tests require that AlphaFold (v. 2.0.1) be installed in the same Python environment. These run components of AlphaFold and OpenFold side by side and ensure that output activations are adequately similar. For most modules, we target a maximum difference of 1e-4.

Copyright notice

While AlphaFold's and, by extension, OpenFold's source code is licensed under the permissive Apache Licence, Version 2.0, DeepMind's pretrained parameters remain under the more restrictive CC BY-NC 4.0 license, a copy of which is downloaded to openfold/resources/params by the installation script. They are thereby made unavailable for commercial use.

Contributing

If you encounter problems using OpenFold, feel free to create an issue! We also welcome pull requests from the community.

Citing this work

Stay tuned for an OpenFold DOI. Any work that cites OpenFold should also cite AlphaFold.

Comments

Cuda/Pytorch/Installation Issues
Hello! So I have been struggling with a strange issue that I hope you or someone would be able to help me with. Let me start by providing some information:

OS: Ubuntu 20.04.4

GPU: NVIDIA RTX A6000

NVIDIA-SMI/Driver Version: 470.129.06

CUDA Version: 11.4

So I am not sure if this is a problem with how I am attempting to install openfold, or if something else is going on. Essentially after cloning the repo the first thing I would do is run scripts/install_third_party_dependencies.sh. This would then create an environment called openfold_venv, however this environment does not seem to withhold many of the required packages (i.e. torch is absent). Following this with scripts/activate_environment.sh seems to fail. I have tried alternatively used conda env create -f environment.yml, which sets up an environment in a different location. Either way, after setting up the environment I end up with one of the following issues, either during python setup.py install or during inference:

"The detected CUDA version (10.1) mismatches the version that was used to compile PyTorch (11.2). Please make sure to use the same CUDA versions." (despite torch.version.cuda returning 11.3)

"runtimeerror: Cuda error: no kernal image is available for execution on the device"

These are run into on clean installs with no conda or cudatoolkits installed anywhere else on the machine, so it is rather puzzling. As I said I am not sure if this is due to performing the install sequence incorrectly but I have tried several different solutions and they all seem to circle back to one of these errors.

I apologize as I know this is rather vague, but if you can offer any sort of guidance it would be greatly appreciated!
opened by Cweb118 40

ModuleNotFoundError: No module named 'torch'

Latest version's installation fails when trying to also install FlashAttention:

(...)
Attempting to install FlashAttention
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/HazyResearch/flash-attention.git@5b838a8bef78186196244a4156ec35bbb58c337d
  Cloning https://github.com/HazyResearch/flash-attention.git (to revision 5b838a8bef78186196244a4156ec35bbb58c337d) to /tmp/pip-req-build-2hclpm0v
  Running command git clone -q https://github.com/HazyResearch/flash-attention.git /tmp/pip-req-build-2hclpm0v
  Running command git rev-parse -q --verify 'sha^5b838a8bef78186196244a4156ec35bbb58c337d'
  Running command git fetch -q https://github.com/HazyResearch/flash-attention.git 5b838a8bef78186196244a4156ec35bbb58c337d
  Running command git checkout -q 5b838a8bef78186196244a4156ec35bbb58c337d
  Resolved https://github.com/HazyResearch/flash-attention.git to commit 5b838a8bef78186196244a4156ec35bbb58c337d
  Running command git submodule update --init --recursive -q
    ERROR: Command errored out with exit status 1:
     command: /usr/local/openfold/openfold/lib/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-2hclpm0v/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-2hclpm0v/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-_azdauhm
         cwd: /tmp/pip-req-build-2hclpm0v/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-2hclpm0v/setup.py", line 10, in <module>
        import torch
    ModuleNotFoundError: No module named 'torch'
    ----------------------------------------
WARNING: Discarding git+https://github.com/HazyResearch/flash-attention.git@5b838a8bef78186196244a4156ec35bbb58c337d. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
(...)

Also, it seems like aws is now also needed, so awscli should probably be added to environment.yml ?

opened by lucajovine 18

OOM with bfloat16, no speed-up

New issue based on: https://github.com/aqlaboratory/openfold/issues/34

Turning on bfloat16 in deepspeed doesn't seem to have the desired effect. Model params size remains unchanged. Hitting OOM in validation which works fine in FP16.

Training with bfloat16 in pytorch-lightning fails:

File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'

Support still missing in deepspeed? https://github.com/microsoft/DeepSpeed/issues/974

Tested on A100 with torch 1.10.1+cu113

opened by lhatsk 14
Is there any alignment files to download?

Hi,

We're trying to reproduce the training process. However, the alignment seems to take extremely long time.

We used 128 nodes to align 128 mmcif files (1 file on each node), but it took 13 hours to finish the entire job.

I'm wondering if there is tar file that already aligned all mmcif files for us to download which will helps a lot.

Thanks

opened by Zhang690683220 13

ModuleNotFoundError: No module named 'flash_attn'

After last update (commit 9225f8725b53d19643d1469a57f7d7baea3c0625):

> python3 run_pretrained_openfold.py
Traceback (most recent call last):
  File "run_pretrained_openfold.py", line 49, in <module>
    from openfold.config import model_config, NUM_RES
  File "/usr/local/openfold/openfold/openfold/__init__.py", line 1, in <module>
    from . import model
  File "/usr/local/openfold/openfold/openfold/model/__init__.py", line 11, in <module>
    _modules = [(m, importlib.import_module("." + m, __name__)) for m in __all__]
  File "/usr/local/openfold/openfold/openfold/model/__init__.py", line 11, in <listcomp>
    _modules = [(m, importlib.import_module("." + m, __name__)) for m in __all__]
  File "/usr/local/openfold/openfold/lib/conda/envs/openfold_venv/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/openfold/openfold/openfold/model/evoformer.py", line 22, in <module>
    from openfold.model.primitives import Linear, LayerNorm
  File "/usr/local/openfold/openfold/openfold/model/primitives.py", line 21, in <module>
    from flash_attn.bert_padding import unpad_input, pad_input
ModuleNotFoundError: No module named 'flash_attn'

opened by lucajovine 12

Unusual predicted structures from pretrained OpenFold on Pascal GPU
This is most likely some kind of local configuration error, but I haven't been able to pin down the cause. If anyone has encountered this behavior before or has an idea of what might be wrong based on these output structures, any hints would be greatly appreciated!

Expected behavior:

run_pretrained_openfold.py outputs predicted structures comparable to AlphaFold or OpenFold Colab output.

I expected a structure similar to this unrelaxed prediction from OpenFold Colab model_1 with finetuning_1.pt:

Actual behavior:

My run_pretrained_openfold.py predicted structures are not similar to AlphaFold or OpenFold Colab output.

Predictions from model_1 with finetuning_1.pt (unrelaxed in tan, relaxed in blue):

Predictions from model_1 with params_model_1.npz:

Predictions from model_1 with params_model_1.npz using alignments from ColabFold MMseqs2 (ColabFold had predicted a reasonable expected structure):

Context:

4 x NVidia 1080-TI GPUs Using CUDA 11.3 (if other system data is relevant I can find it)

input/short.fasta

>query MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

Run command:

python3 run_pretrained_openfold.py \ input \ data/pdb_mmcif/mmcif_files/ \ --output_dir output \ --cpus 16 \ --preset reduced_dbs \ --uniref90_database_path data/uniref90/uniref90.fasta \ --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \ --pdb70_database_path data/pdb70/pdb70 \ --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --model_device "cuda:0" \ --jackhmmer_binary_path $venv_bin_dir/jackhmmer \ --hhblits_binary_path $venv_bin_dir/hhblits \ --hhsearch_binary_path $venv_bin_dir/hhsearch \ --kalign_binary_path $venv_bin_dir/kalign \ --config_preset "model_1" \ --openfold_checkpoint_path openfold/resources/openfold_params/finetuning_1.pt

Other configurations I tried, which produced similarly strange outputs:

Removing --openfold_checkpoint_path to just use the AlphaFold weights

Using --config_preset "model_1_ptm" with finetuning_ptm_2.pt

Using --use_precomputed_alignments with alignment results from a previous OpenFold output

Using --use_precomputed_alignments with .a3m results from ColabFold

Using full_dbs instead of reduced_dbs
opened by epenning 11
Custom template results in huge difference with alphafold
Hi there,

Thanks a lot for your effort to implement trainable AlphaFold in PyTorch.

I came across an interesting paper claiming using templates built with the information from experimental cryo-EM density maps can improve the AlphaFold accuracy.

The authors provide a Colab notebook here. I tried the notebook, and it worked as intended.

As an example, the PDB entry 7KU7: Input fasta sequence: PLREAKDLHTALHIGPRALSKACNISMQQAREVVQTCPHCNSAPALEAGVNPRGLGPLQIWQTDFTLEPRMAPRSWLAVTVDTASSAIVVTQHGRVTSVAVQHHWATAIAVLGRPKAIKTDNGSCFTSKSTREWLARWGIAHTTGIPGNSQGQAMVERANRLLKDKIRVLAEGDGFMKRIPTSKQGELLAKAMYALNHFERGENTKTPIQKHWRPTVLTEGPPVKIRIETGEWEKGWNVLVWGRGYAAVKNRDTDKVIWVPSRKVKPDITQKDEVTKK

I supplemented a custom template in CIF format: https://drive.google.com/file/d/1DUN793nHr0aRRSp29_FwgTGUREwTHcfp/view?usp=sharing

By using this template and turning off the MSA (skip_all_msa == True, equivalent to using dummy MSA), the mean plddt score is about 90, which is higher than the case with MSA but no custom template.

When I tried to replicate the above procedure in OpenFold, however, it looked like the template didn't help. The mean plddt score was less than 40 for model_1 to 5.

To quickly reproduce the results,

I make an empty directory as the path for the use_precomputed_alignments, which will lead the data pipeline to use the dummy MSA and an empty template.

Then I load template features generated in the Colab notebook template_feature_7ku7.pkl (https://drive.google.com/file/d/1pnZ8pwQZTgcOsHTikQ6X7PQ1bqQs3tqt/view?usp=sharing)

import pickle with open("template_feature_7ku7.pkl", "rb") as f: template_feature = pickle.load(f) feature_dict = {**feature_dict, **template_feature}

The rest of the codes are left intact. So, could you help me check if there is anything wrong with my approach, or is it due to something buggy with template associated codes within the OpenFold? Thank you very much.
opened by empyriumz 11

Get low lddt score while running inference.

Excellent work!

I'm trying to run inference process of openfold. My input fasta is :

HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

My shell command is:
python3 run_pretrained_openfold.py \

data/fasta_dir \
data/pdb_mmcif/mmcif_files/ \
--uniref90_database_path data/uniref90/uniref90.fasta \
--mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path data/pdb70/pdb70 \
--uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--output_dir output/ \
--bfd_database_path data/small_bfd/bfd-first_non_consensus_sequences.fasta \
--model_device "cuda:0" \
--jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
--hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
--hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
--kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
--config_preset "finetuning_ptm" \
--openfold_checkpoint_path openfold/resources/openfold_params/finetuning_ptm_2.pt

And i got the following result:

The average of lddt is pretty low. And this situation happens every time even when i choose simple sequence. Moreover, I notice that parameter 'use_small_bfd' is set to be false by default. But inference works when i set 'bfd_database_path' data/small_bfd/bfd-first_non_consensus_sequences.fasta.

I'm wondering what happened and hope for your reply.

opened by WeixuanXiong 10

Bug with template mask for batch inference

Hello, my name is James, and I'm working on training a new AlphaFold variant using OpenFold. Thanks for the great tool!

I think I may have found a bug in how the code processes templates for batch sizes larger than 1 (either that or I'm doing something wrong, in which case help would also be appreciated!). Here's a code snippet that reproduces the problem:

import torch
import torch.nn as nn
import numpy as np

from openfold.model.model import AlphaFold
from openfold.config import model_config
from openfold.utils.tensor_utils import tensor_tree_map
from openfold.data import data_transforms

model_name = "model_1_ptm"

conf = model_config(model_name, train=True)
conf.data.common.max_recycling_iters = 0
conf.data.train.subsample_templates = False
conf.data.train.max_msa_clusters = 1
conf.data.train.max_extra_msa = 1
conf.data.train.max_templates = 1

# copied from openfold/test/data_utils.py
def random_template_feats(n_templ, n, batch_size=None):
    b = []
    if batch_size is not None:
        b.append(batch_size)
    batch = {
        "template_mask": np.random.randint(0, 2, (*b, n_templ)),
        "template_pseudo_beta_mask": np.random.randint(0, 2, (*b, n_templ, n)),
        "template_pseudo_beta": np.random.rand(*b, n_templ, n, 3),
        "template_aatype": np.random.randint(0, 22, (*b, n_templ, n)),
        "template_all_atom_mask": np.random.randint(
            0, 2, (*b, n_templ, n, 37)
        ),
        "template_all_atom_positions": 
            np.random.rand(*b, n_templ, n, 37, 3) * 10,
        "template_torsion_angles_sin_cos": 
            np.random.rand(*b, n_templ, n, 7, 2),
        "template_alt_torsion_angles_sin_cos": 
            np.random.rand(*b, n_templ, n, 7, 2),
        "template_torsion_angles_mask": 
            np.random.rand(*b, n_templ, n, 7),
    }
    batch = {k: v.astype(np.float32) for k, v in batch.items()}
    batch["template_aatype"] = batch["template_aatype"].astype(np.int64)
    return batch


def random_extra_msa_feats(n_extra, n, batch_size=None):
    b = []
    if batch_size is not None:
        b.append(batch_size)
    batch = {
        "extra_msa": np.random.randint(0, 22, (*b, n_extra, n)).astype(
            np.int64
        ),
        "extra_has_deletion": np.random.randint(0, 2, (*b, n_extra, n)).astype(
            np.float32
        ),
        "extra_deletion_value": np.random.rand(*b, n_extra, n).astype(
            np.float32
        ),
        "extra_msa_mask": np.random.randint(0, 2, (*b, n_extra, n)).astype(
            np.float32
        ),
    }
    return batch

n_templ = 1
n_res = 256
n_extra_seq = 1
n_seq = 1
bsize = 2

model = AlphaFold(conf).cuda()

batch = {}


tf = torch.randint(conf.model.input_embedder.tf_dim - 1, size=(bsize, n_res))
batch["target_feat"] = nn.functional.one_hot(tf, conf.model.input_embedder.tf_dim).float()
batch["aatype"] = torch.argmax(batch["target_feat"], dim=-1)

batch["target_feat"] = torch.rand((bsize, n_res, conf.model.input_embedder.tf_dim))
batch["residue_index"] = torch.rand((bsize, n_res))
batch["msa_feat"] = torch.rand((bsize, n_seq, n_res, conf.model.input_embedder.msa_dim))


t_feats = random_template_feats(n_templ, n_res, batch_size=bsize)
batch.update({k: torch.tensor(v) for k, v in t_feats.items()})

extra_feats = random_extra_msa_feats(n_extra_seq, n_res, batch_size=bsize)
batch.update({k: torch.tensor(v) for k, v in extra_feats.items()})

batch["msa_mask"] = torch.randint(low=0, high=2, size=(bsize, n_seq, n_res)).float()
batch["seq_mask"] = torch.randint(low=0, high=2, size=(bsize, n_res)).float()
batch.update(data_transforms.make_atom14_masks(batch))

batch["no_recycling_iters"] = torch.tensor(0.)

batch = tensor_tree_map(lambda t: t.unsqueeze(-1).cuda(), batch)

out = model(batch)

In this code I'm basically just running inference on the model with a batch size of 2, with templates enabled. For this demo I've created dummy inputs using the code from the /openfold/tests/ directory, although I've also had the same problem with a real data pipeline.

The code above crashes with the error: RuntimeError: The size of tensor a (128) must match the size of tensor b (2) at non-singleton dimension 3, which occurs on line 189 of /openfold/openfold/model/model.py.:

t = t * (torch.sum(batch["template_mask"], dim=-1) > 0)

This line is basically just masking out activations from templates that don't exist according to batch["template_mask"]. However, there seems to be a dimension mismatch. If I print out the dimensions, t has shape [2, 256, 256, 128] and batch["template_mask"] has shape [2]. Based on the PyTorch broadcasting rules (https://pytorch.org/docs/stable/notes/broadcasting.html), those shapes aren't compatible to multiply. If I change the code to the following:

t = t * (torch.sum(batch["template_mask"], dim=-1) > 0).view([-1,1,1,1])

Then everything works fine. Is this a real bug in the code, or have I done something wrong to trigger this error? Thanks! For reference, my environment is the following:

Python 3.10.4
PyTorch 1.12.1
Numpy 1.23.1
Cuda 11.1
Latest OpenFold commit (6e930a6ca4accb14aa128ae40bd3f27906796589)

opened by jproney 9

OpenFold on Ampere Nvidia GPUs

Hi,

I am trying to install OpenFold on a machine with two RTX A5000s, but running into issues with PyTorch not supporting cards with compute capability SM 86. I saw on a previous post that you had trained OF on A100s, which will have a similar compute capability. Is there a method for installing OpenFold on newer GPU architectures?

Many thanks!

opened by WillExeter 9
--trace_model performance

Hi, I have tested using the --trace_model mode on a small batch of sequences of the same length; I get an 80s tracing time followed by 20s inference for each sequence. If I just fold them without --trace_model it takes 18-19s for inference of each. Am I doing something wrong? There doesn't seem to be much documentation about this feature.

opened by mrhoag5 8

Run OpenFold on CPU

Hello,

I have issues when running openfold on a CPU.

When I execute the run_pretrained_openfold.py script with the --model_device cpu argument set, I get the following error:

Traceback (most recent call last):
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/run_pretrained_openfold.py", line 387, in <module>
    main(args)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/run_pretrained_openfold.py", line 254, in main
    out = run_model(model, processed_feature_dict, tag, args.output_dir)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/script_utils.py", line 159, in run_model
    out = model(batch)
  File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/model.py", line 512, in forward
    outputs, m_1_prev, z_prev, x_prev = self.iteration(
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/model.py", line 366, in iteration
    z = self.extra_msa_stack(
  File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/evoformer.py", line 1007, in forward
    m, z = b(m, z)
  File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/evoformer.py", line 518, in forward
    self.msa_att_row(
  File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 266, in forward
    m = self._chunk(
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 121, in _chunk
    return chunk_layer(
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/chunk_utils.py", line 299, in chunk_layer
    output_chunk = layer(**chunks)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 101, in fn
    return self.mha(
  File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/primitives.py", line 492, in forward
    o = attention_core(q, k, v, *((biases + [None] * 2)[:2]))
  File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/kernel/attention_core.py", line 47, in forward
    attn_core_inplace_cuda.forward_(
RuntimeError: input must be a CUDA tensor

This tells me, I need to pass a CUDA tensor to some attention thing, but I run the code on CPU, there should be no CUDA involved?!

This is the environment (env.txt) I'm using on a normal linux 64-bit OS.

Thank for any help in advance. Roman

opened by Old-Shatterhand 0

Alignment error

I ran the following scripts and get the error in alignment.

python3 scripts/precompute_alignments.py data/pdb_mmcif/mmcif_files/ data/alignment/ \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus 16 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

Here is the error log:

ERROR:root:- 09:17:56.463 ERROR: Error in /opt/conda/conda-bld/hhsuite_1645696999782/work/src/hhalignment.cpp:3539: MergeMasterSlave:
ERROR:root:- 09:17:56.463 ERROR:        did not find 145 match states in sequence 1 of tr|A0A1D1YLJ1|A0A1D1YLJ1_9ARAE. Sequence:
ERROR:root:GYKAPELTKMKDAGKESDIYSLGVIFLEMVTRKDTNSDFLPTWDLHLSNSLKNPVFDGKISEMISHGLLRQSREQNCITGEGLLMFLQLAIACRSPSPRLRPDIKQVLGKLEEIELWKLPNQFGGDRLPNRG
ERROR:root:HHblits stderr end
WARNING:root:Failed to run alignments for 7avy_A. Skipping...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "scripts/precompute_alignments.py", line 40, in run_seq_group_alignments
    fasta_path, alignment_dir
  File "/scratch1/zx22/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/openfold-1.0.0-py3.7-linux-x86_64.egg/openfold/data/data_pipeline.py", line 485, in run
    self.hhblits_bfd_uniclust_runner.query(fasta_path)
  File "/scratch1/zx22/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/openfold-1.0.0-py3.7-linux-x86_64.egg/openfold/data/tools/hhblits.py", line 162, in query
    % (stdout.decode("utf-8"), stderr[:500_000].decode("utf-8"))
RuntimeError: HHblits failed
stdout:
stderr:

Could you help me with it? Thanks!

opened by Ottovonxu 0

Training scripts.

I have downloaded the dataset following the DeepMind style and the inference works fine. Currently, my data folder has: bfd mgnify pdb70 pdb_mmcif uniclust30 uniref90

May I ask how should I specify the mmcif_dir/ here?

python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus 16 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

Thanks!

opened by Ottovonxu 0

openfold/np/protein.py:to_pdb(): chain_tag sometimes not set
I found what appears to be a rare case (once in millions of proteins) where the loop in to_pdb() sometimes fails to set chain_tag before closing the chain, causing an error:

Traceback (most recent call last): File "/pscratch/sd/f/flowers/esm/scripts/esmfold_inference.py", line 186, in <module> pdbs = model.output_to_pdb(output) File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/esm/esmfold/v1/esmfold.py", line 303, in output_to_pdb return output_to_pdb(output) File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/esm/esmfold/v1/misc.py", line 115, in output_to_pdb pdbs.append(to_pdb(pred)) File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/openfold/np/protein.py", line 373, in to_pdb f"{chain_tag:>1}{residue_index[i]:>4}" UnboundLocalError: local variable 'chain_tag' referenced before assignment

It's possible esmfold was passing bad parameters, but adding a check to set chain_tag to "A" if not set allowed the code to run without errors.

The protein in question was

MAPVKVFGPAKSRNVARVLVCLEEVGAEYEVVDMDLKALEHKSPEHLARNPFGQTPAFQDGDLLLFESRAISRYVLRKYKTNQVDLLREGNLKEAAMVDVWTEVDAHTYNPAISPVVYECLINPLVLGIPTNQKVVDESLEKLKKALEVYEAHLSKDKYLAGDFMSFADINHFPHTCSFMAAPHAVLFDSYPYVKAWWERLMARPSIKKLSASLAPPKA*

And the tail of the output pdb (when run with the modified code) was:

ATOM 1736 CB ALA A 219 -14.556 -18.156 -6.584 1.00 83.46 C
ATOM 1737 O ALA A 219 -16.753 -18.815 -4.504 1.00 84.66 O
TER 1738 UNK A 220 PARENT N/A TER 1739 ALA A 1 END
opened by flowers9 0
Should train_chain_data_cache_path be a required argument?

Although the current argparse parser allows the user to not pass a value for train_chain_data_cache_path, the current implementation of data_modules.OpenFoldDataset (specifically the inner function, looped_samples) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:

https://github.com/aqlaboratory/openfold/blob/59277de16825cfdafe37033012d0530595b9ad6d/openfold/data/data_modules.py#L360-L374

It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in OpenFoldDataset/looped_samples for the case that the cache is None?

Happy to help implement something either way!

opened by jonathanking 0

Releases(v1.0.1)

v1.0.1(Nov 23, 2022)
OpenFold as of the release of our manuscript. Many new features, including FP16 training + more stable training.

What's Changed

use multiple models for inference by @decarboxy in https://github.com/aqlaboratory/openfold/pull/117

Update input processing by @brianloyal in https://github.com/aqlaboratory/openfold/pull/116

adding a caption to the image in the readme by @decarboxy in https://github.com/aqlaboratory/openfold/pull/133

Properly handling file outputs when multiple models are evaluated by @decarboxy in https://github.com/aqlaboratory/openfold/pull/142

Fix for issue in download_mgnify.sh by @josemduarte in https://github.com/aqlaboratory/openfold/pull/166

Fix tag-sequence mismatch when predicting for multiple fastas by @sdvillal in https://github.com/aqlaboratory/openfold/pull/164

Support openmm >= 7.6 by @sdvillal in https://github.com/aqlaboratory/openfold/pull/163

Fixing issue in download_uniref90.sh by @josemduarte in https://github.com/aqlaboratory/openfold/pull/171

Fix propagation of use_flash for offloaded inference by @epenning in https://github.com/aqlaboratory/openfold/pull/178

Update deepspeed version to 0.5.10 by @NZ99 in https://github.com/aqlaboratory/openfold/pull/185

Fixes errors when processing .pdb files by @NZ99 in https://github.com/aqlaboratory/openfold/pull/188

fix incorrect learning rate warm-up after restarting from ckpt by @Zhang690683220 in https://github.com/aqlaboratory/openfold/pull/182

Add opencontainers image-spec to Dockerfile by @SauravMaheshkar in https://github.com/aqlaboratory/openfold/pull/128

Write inference and relaxation timings to a file by @brianloyal in https://github.com/aqlaboratory/openfold/pull/201

Minor fixes in setup scripts by @timodonnell in https://github.com/aqlaboratory/openfold/pull/202

Minor optimizations & fixes to support ESMFold by @nikitos9000 in https://github.com/aqlaboratory/openfold/pull/199

Drop chains that are missing (structure) data in training by @timodonnell in https://github.com/aqlaboratory/openfold/pull/210

adding a script for threading a sequence onto a structure by @decarboxy in https://github.com/aqlaboratory/openfold/pull/206

Set pin_memory to True in default dataloader config. by @NZ99 in https://github.com/aqlaboratory/openfold/pull/212

Fix missing subtract_plddt argument in prep_output call by @mhrmsn in https://github.com/aqlaboratory/openfold/pull/217

fp16 fixes by @beiwang2003 in https://github.com/aqlaboratory/openfold/pull/222

Set clamped vs unclamped FAPE for each sample in batch independently by @ar-nowaczynski in https://github.com/aqlaboratory/openfold/pull/223

Fix probabilities type (int -> float) by @atgctg in https://github.com/aqlaboratory/openfold/pull/225

Small fix for prep_mmseqs_dbs. by @jonathanking in https://github.com/aqlaboratory/openfold/pull/232

New Contributors

@brianloyal made their first contribution in https://github.com/aqlaboratory/openfold/pull/116

@josemduarte made their first contribution in https://github.com/aqlaboratory/openfold/pull/166

@sdvillal made their first contribution in https://github.com/aqlaboratory/openfold/pull/164

@epenning made their first contribution in https://github.com/aqlaboratory/openfold/pull/178

@NZ99 made their first contribution in https://github.com/aqlaboratory/openfold/pull/185

@Zhang690683220 made their first contribution in https://github.com/aqlaboratory/openfold/pull/182

@SauravMaheshkar made their first contribution in https://github.com/aqlaboratory/openfold/pull/128

@timodonnell made their first contribution in https://github.com/aqlaboratory/openfold/pull/202

@nikitos9000 made their first contribution in https://github.com/aqlaboratory/openfold/pull/199

@mhrmsn made their first contribution in https://github.com/aqlaboratory/openfold/pull/217

@beiwang2003 made their first contribution in https://github.com/aqlaboratory/openfold/pull/222

@ar-nowaczynski made their first contribution in https://github.com/aqlaboratory/openfold/pull/223

@atgctg made their first contribution in https://github.com/aqlaboratory/openfold/pull/225

@jonathanking made their first contribution in https://github.com/aqlaboratory/openfold/pull/232

Full Changelog: https://github.com/aqlaboratory/openfold/compare/v1.0.0...v1.0.1
Source code(tar.gz)
Source code(zip)
v1.0.0(Jun 22, 2022)
OpenFold at the time of the release of our original model parameters and training database. Adds countless improvements over the previous beta release, including, but not limited to:

Many bugfixes contribute to stabler, more correct, and more versatile training

Options to run OpenFold using our original weights

Custom attention kernels and alternative attention implementations that greatly reduce peak memory usage

A vastly superior Colab notebook that runs inference many times faster than the original

Efficient scripts for computation of alignments, including the option to run MMSeqs2's alignment pipeline

Vastly improved logging during training & inference

Careful optimizations for significantly improved speeds & memory usage during both inference and training

Opportunistic optimizations that dynamically speed up inference on short (< ~1500 residues) chains

Certain changes borrowed from updates made to the AlphaFold repo, including bugfixes, GPU relaxation, etc.

"AlphaFold-Gap" support allows inference on complexes using OpenFold and AlphaFold weights

WIP OpenFold-Multimer implementation on the multimer branch

Improved testing for the data pipeline

Partial CPU offloading extends the upper limit on inference sequence lengths

Docker support

Missing features from the original release, including learning rate schedulers, distillation set support, etc.

Full Changelog: https://github.com/aqlaboratory/openfold/compare/v0.1.0...v1.0.0
Source code(tar.gz)
Source code(zip)
v0.1.0(Nov 18, 2021)

The initial release of OpenFold.

Full Changelog: https://github.com/aqlaboratory/openfold/commits/v0.1.0
Source code(tar.gz)
Source code(zip)

Trainable PyTorch reproduction of AlphaFold 2

Related tags

Overview

OpenFold

Features

Installation (Linux)

Usage

Inference

Training

Testing

Copyright notice

Contributing

Citing this work

Comments

Expected behavior:

Actual behavior:

Context:

Releases(v1.0.1)

v1.0.1(Nov 23, 2022)

What's Changed

New Contributors

v1.0.0(Jun 22, 2022)

v0.1.0(Nov 18, 2021)

Owner

AQ Laboratory

A pytorch reproduction of { Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation }.

Reproduction process of AlexNet

Classical OCR DCNN reproduction based on PaddlePaddle framework.

YOLOv5🚀 reproduction by Guo Quanhao using PaddlePaddle

YOLOX-Paddle - A reproduction of YOLOX by PaddlePaddle

Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

An essential implementation of BYOL in PyTorch + PyTorch Lightning

RealFormer-Pytorch Implementation of RealFormer using pytorch

Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

A pytorch implementation of Pytorch-Sketch-RNN

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

A bunch of random PyTorch models using PyTorch's C++ frontend