OntoProtein: Protein Pretraining With Ontology Embedding

ZJUNLP

Last update: Dec 14, 2022

Related tags

Deep Learning knowledge-graph gene-ontology protein-protein-interaction protein-function-prediction protein-structure-prediction protein-pretraining

Overview

OntoProtein

This is the implement of the paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.

Quick links

Overview
Requirements
Data preparation
- Pre-training data
- Downstream task data
Protein pre-training model
Usage for protein-related tasks
Citation

Overview

In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.

Requirements

To run our code, please install dependency packages for related steps.

Environment for pre-training data generation

python3.8 / biopython 1.37 / goatools

Environment for OntoProtein pre-training

python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /

Environment for protein-related tasks

python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb

Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:

BLAST / Interproscan / DeepGraphGO / GNN-PPI

Data preparation

For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.

Pre-training data

To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.

Download released data

We have released our prepared data ProteinKG25 in Google Drive.

The whole compressed package includes following files:

go_def.txt: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.
go_type.txt: The ontology type which the specific GO term belong to. The index is correponding to GO ID in go2id.txt file.
go2id.txt: The ID mapping of GO terms.
go_go_triplet.txt: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is < h r t>, where h and t are respectively head entity and tail entity, both GO term nodes. r is relation between two GO terms, e.g. is_a and part_of.
protein_seq.txt: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.
protein2id.txt: The ID mapping of proteins.
protein_go_train_triplet.txt: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is <h r t>, where h and t are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function. r is relation between the protein and GO term.
relation2id.txt: The ID mapping of relations. We mix relations in two triplet relation.

Generate your own pre-training data

For generating your own pre-training data, you need download following raw data:

go.obo: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`
uniprot_sprot.dat: protein Swiss-Prot database. [link]
goa_uniprot_all.gpa: Gene Annotation data. [link]

When download these raw data, you can excute following script to generate pre-training data:

python tools/gen_onto_protein_data.py

Downstream task data

Our experiments involved with several protein-related downstream tasks. [Download datasets]

Protein pre-training model

You can pre-training your own OntoProtein based above pretraining dataset. We provide the script bash script/run_pretrain.sh to run pre-training. And the detailed arguments are all listed in src/training_args.py, you can set pre-training hyperparameters to your need.

Usage for protein-related tasks

Running examples

The shell files of training and evaluation for every task are provided in script/ , and could directly run.

Also, you can utilize the running codes run_downstream.py , and write your shell files according to your need:

run_downstream.py: support {ss3, ss8, contact, remote_homology, fluorescence, stability} tasks;

Training models

Running shell files: bash script/run_{task}.sh, and the contents of shell files are as follow:

sh run_main.sh \
    --model ./model/ss3/ProtBertModel \
    --output_file ss3-ProtBert \
    --task_name ss3 \
    --do_train True \
    --epoch 5 \
    --optimizer AdamW \
    --per_device_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --eval_step 100 \
    --eval_batchsize 4 \
    --warmup_ratio 0.08 \
    --frozen_bert False

You can set more detailed parameters in run_main.sh. The details of main.sh are as follows:

LR=3e-5
SEED=3
DATA_DIR=data/datasets
OUTPUT_DIR=data/output_data/$TASK_NAME-$SEED-$OI

python run_downstream.py \
  --task_name $TASK_NAME \
  --data_dir $DATA_DIR \
  --do_train $DO_TRAIN \
  --do_predict True \
  --model_name_or_path $MODEL \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $EB \
  --gradient_accumulation_steps $GS \
  --learning_rate $LR \
  --num_train_epochs $EPOCHS \
  --warmup_ratio $WR \
  --logging_steps $ES \
  --eval_steps $ES \
  --output_dir $OUTPUT_DIR \
  --seed $SEED \
  --optimizer $OPTIMIZER \
  --frozen_bert $FROZEN_BERT \
  --mean_output $MEAN_OUTPUT \

Notice: the best checkpoint is saved in OUTPUT_DIR/.

Comments

[Confirmation] Optimal Hyperparameters and Reproducibility
Hi there,

Thanks for providing the nice codebase. I'm trying to reproduce the results for downstream tasks, and I have the following questions.

I'm wondering if the scripts under this folder are only samples? For the optimal hyperparameters for OntoProtein, we should follow Table 6 in the paper?

For ProtBert, are you using the same optimal hyperparameters for each downstream task?

Table 6 doesn't cover the optimal values for gradient_accumulation_steps and eval_step. Can you help clarify this?

Any help is appreciated.
question
opened by chao1224 19
run_contact.sh error

Hi, I setup fresh environment for running the script and when I run [run_contact.sh] I get the following error in "contact-ontoprotein.out"

***** Running Prediction ***** Num examples = 40 Batch size = 1 Traceback (most recent call last): File "run_downstream.py", line 286, in main() File "run_downstream.py", line 281, in main predictions_family, input_ids_family, metrics_family = trainer.predict(test_dataset) File "/home/sakher/miniconda3/envs/onto2/lib/python3.8/site-packages/transformers/trainer.py", line 2358, in predict output = eval_loop( File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 217, in evaluation_loop loss, logits, labels, prediction_score = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 50, in prediction_step prediction_score['precision_at_l2'] = logits[3]['precision_at_l2'] KeyError: 'precision_at_l2'
question

opened by sa5r 7
goatools包问题

博主你好，请问一下您在运行gen_onto_protein_data.py文件中create_go_data部分时是否出现下面类似问题： goatools版本为1.2.3时，go_term.definition报错：没有.definition属性 goatools版本为1.0.11时，提示RecursionError: maximum recursion depth exceeded while calling a Python object

opened by lonetravelwolf 7
Issue in creating an environment for OntoProtein pretraining
Hello Researchers, I am finding the bugs in installing the deepspeed of version=0.5.1. I have already installed python 3.8.13, pytorch=1.12.0 with torch vision=0.13.0, torch audio=0.12.0, and cudatookit=11.3.1, tranformers=4.9.2, lmdb=1.3.0. But when I install the deepspeed=0.5.1. My all dependencies are not installed correctly for deepspeed. can you please tell the exact versions which you have used for pytorch, python, and deepspeed? Below is the error which I found: Traceback (most recent call last): File "", line 1, in File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/init.py", line 15, in from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 20, in from tensorboardX import SummaryWriter File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/init.py", line 5, in from .torchvis import TorchVis File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/torchvis.py", line 11, in from .writer import SummaryWriter File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/writer.py", line 15, in from .event_file_writer import EventFileWriter File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 28, in from .proto import event_pb2 File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/event_pb2.py", line 15, in from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2 File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2 File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2 File "//mnt/user1/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 35, in _descriptor.FieldDescriptor( File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in new _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.

Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower). Can you tell the exact versions which you have used?

help wanted
opened by amalislam675 5
Question about computing resource and batch size

Hi,

Thanks for sharing the code. I noticed in your run_pretrain.sh, the batch size of protein-GO and protein MLM is 8, while the batch size of GO-GO is 64. Meanwhile, the num of negative samples for each positive sample is 128, or 256 for GO-GO.

(1) Does this mean in each GO-GO pass, at most (64*2+64*256) samples of length at most 128 are fed into the GO encoder (in one batch)?

(2) How many V100s did you use for this pretraining?

Also, I noticed that you didn't permutate proteins for protein-GO relations.

(3) Is this due to computing resource limit (i.e. 8*128 is just too large a number for proteins)?

(4) Did you experiment with a lower number of negative samples while considering such protein permutation?

Thanks in advance!

opened by jasperhyp 4

ontoProtein pretrained model

您好，我想使用ontoProtein计算蛋白质的embedding，我在https://huggingface.co/zjukg/OntoProtein/tree/main 上下载了模型保存在本地，但是不同蛋白计算的embedding是一样的，请问这样正常吗？

下载文件到本地包括config.json pytorch_model.bin tokenizer_config.json vocab.txt四个文件
计算embedding的脚本

import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from torch.utils.data.dataloader import DataLoader
import yaml
import os
import numpy as np
import torch
from tqdm import tqdm
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModel,
)
import argparse
import torch
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
logger = logging.getLogger(__name__)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
model_name_or_path = '/data/wenyuhao/55/model/ontology'
config = AutoConfig.from_pretrained(model_name_or_path,)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,use_fast=False,)
model = AutoModel.from_pretrained(model_name_or_path,config=config,).to(device)
def getArray(seq):
    input_ids = torch.tensor(tokenizer.encode(seq)).unsqueeze(0).to(device)  # Batch size 1
    with  torch.no_grad():
        outputs = model(input_ids)
    return outputs[1].cpu().numpy()

效果

In [14]: a = getArray('VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS')
In [15]: b = getArray('YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF')
In [16]: a
Out[16]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [17]: b
Out[17]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [18]: Counter(a[0]==b[0])
Out[18]: Counter({True: 1024})

我计算了swissprot的所有蛋白质，发现都是一样的

In [29]: s
Out[29]: 
array([[-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       ...,
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988]], dtype=float32)

In [30]: s.shape
Out[30]: (20083, 1024)

In [31]: (s==s).all()
Out[31]: True

opened by wenyuhaokikika 4

run_pretrain.sh 报错

我配置了deepspeed环境，然后运行run_pretrain.sh,但出现了以下错误：

File "run_pretrain.py", line 135, in main() File "run_pretrain.py", line 131, in main trainer.train() File "OntoProtein/src/trainer.py", line 167, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 405, in deepspeed_init hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 267, in trainer_config_finalize hidden_size = model.config.hidden_size File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'OntoProteinPreTrainedModel' object has no attribute 'config'

然后我将config属性指向protein_model_config，并且运行了training_arg.py中注释掉的部分，结果出现了以下错误：

File "run_pretrain.py", line 135, in main() File "run_pretrain.py", line 131, in main trainer.train() File "OntoProtein/src/trainer.py", line 167, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 437, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize engine = DeepSpeedEngine(args=args, File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in init self._configure_with_arguments(args, mpu) File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments self._config = DeepSpeedConfig(self.config, mpu) File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in init self._configure_train_batch_size() File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1051, in _configure_train_batch_size self._batch_assertion() File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 987, in _batch_assertion train_batch > 0 TypeError: '>' not supported between instances of 'str' and 'int'

请问这是什么原因引起的呢？
question

opened by Seyfried97 4
Computational Resources and Time

Can you provide a recommendation for the allocated resources of computational power to run one of the downstream tasks, like run_contact including fine-tuning the mode, i.e do_train = True , like the suggest number of cores and memory and how long it is expected to take. And what were the ones used in experiments and how long it took?

I am trying to run the protein contact prediction task on 16 cores and 120GB of memory with an estimation of a week required to get the results, however, I keep getting the process killed because of the insufficient memory space.
question

opened by sa5r 4
Rationale for choosing this loss function

Regarding your KE loss function, could you kindly provide some intuitions on why this specific loss function was chosen (given there are so many metric learning losses on KG)? A few relevant pieces of literature that you referenced would be appreciated.

opened by jasperhyp 2

ImportError in deepspeed.py

Hi,

Thanks for open-sourcing this really cool model.

I'm trying to play around with pretraining it myself, but I run into this ImportError when I run the run_pretrain.sh script. I would greatly appreciate any guidance, thanks!

(onto_env) tomcobley@compute-g-17-147:~/OntoProtein $ bash script/run_pretrain.sh 

[2022-11-26 21:06:55,053] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with loc
al resources only.
[2022-11-26 21:06:56,224] [INFO] [runner.py:508:main] cmd = /home/tomcobley/.conda/envs/onto_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_pretrain.py --do_train --output_dir data/output_data/filtered_ke_text --pretrain_data_dir path/todata/ProteinKG25 --protein_seq_data_file_name swiss_seq --in_memory true --max_protein_seq_length 1024 --model_protein_seq_data true --model_protein_go_data true --model_go_go_data true --use_desc true --max_text_seq_length 128 --dataloader_protein_go_num_workers 1 --dataloader_go_go_num_workers 1 --dataloader_protein_seq_num_workers 1 --num_protein_go_neg_sample 128 --num_go_go_neg_sample 128 --negative_sampling_fn simple_random --protein_go_sample_head false --protein_go_sample_tail true --go_go_sample_head true --go_go_sample_tail true --protein_model_file_name data/model_data/ProtBERT --text_model_file_name data/model_data/PubMedBERT --go_encoder_cls bert --protein_encoder_cls bert --ke_embedding_size 512 --double_entity_embedding_size false --max_steps 60000 --per_device_train_batch_size 4 --weight_decay 0.01 --optimize_memory true --gradient_accumulation_steps 256 --lr_scheduler_type linear --mlm_lambda 1.0 --lm_learning_rate 1e-5 --lm_warmup_steps 50000 --ke_warmup_steps 50000 --ke_lambda 1.0 --ke_learning_rate 2e-5 --ke_max_score 12.0 --ke_score_fn transE --ke_warmup_ratio --seed 2021 --deepspeed dp_config.json --fp16 --dataloader_pin_memory


Traceback (most recent call last):
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/tomcobley/OntoProtein/deepspeed.py", line 25, in <module>
    from .dependency_versions_check import dep_version_check
ImportError: attempted relative import with no known parent package

question

opened by tomcobley 2

Trainer problem for pre-training
Hello, researchers Thanks for your research. I have some problems with the pre-training phases.

did the pre-training code only support the training manner with epochs? Because when I replace the parameter "--max_steps" with "--num_train_epochs," I get an exception. Therefore, I am not sure whether trainer.py support the training manner with epochs.

if Q1 is not, I got the following results for steps, therefore, is the loss for each step meaningful? and Another question, Why does the loss of "mlm" keep oscillating after certain steps? Could you give me any advice about this situation? {'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229

Best regards, Xinghao
opened by nihaowxh 2

Owner

ZJUNLP

NLP Group of Knowledge Engine Lab at Zhejiang University

GitHub

Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

159 Dec 15, 2022

Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

[TensorFlow] Protein Interface Prediction using Graph Convolutional Networks Unofficial TensorFlow implementation of Protein Interface Prediction usin

9 Oct 25, 2022

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

108 Nov 23, 2022

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 9, 2021

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

ProGen - (wip) Implementation and replication of ProGen, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily

71 Dec 1, 2022

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

24 Nov 23, 2022

Replication attempt for the Protein Folding Model

RGN2-Replica (WIP) To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding f

36 Nov 29, 2022

A geometric deep learning pipeline for predicting protein interface contacts.

44 Dec 30, 2022

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training Code for our paper "Predicting lncRNA–protein interactio

1 Nov 29, 2022

A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

185 Jan 7, 2023

A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP：A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

0 Mar 25, 2022

Uni-Fold: Training your own deep protein-folding models

Uni-Fold: Training your own deep protein-folding models. This package provides an implementation of a trainable, Transformer-based deep protein foldin

187 Jan 4, 2023

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

2 Dec 14, 2022

Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

74 Dec 7, 2022

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

19 May 6, 2022

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

RITA: a Study on Scaling Up Generative Protein Sequence Models RITA is a family of autoregressive protein models, developed by a collaboration of Ligh

69 Dec 22, 2022

CLASP - Contrastive Language-Aminoacid Sequence Pretraining

CLASP - Contrastive Language-Aminoacid Sequence Pretraining Repository for creating models pretrained on language and aminoacid sequences similar to C

133 Dec 29, 2022

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

83 Dec 21, 2022