Protein Language Model

THUDM

Last update: Dec 27, 2022

Related tags

Text Data & NLP ProteinLM

Overview

ProteinLM

We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.

Overview

The proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data.

Guidance

We provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.

Pretrain README
Finetune README

Download ProteinLM

ProteinLM (200M)

For the pretrained model with 200 million parameters, you can download model checkpoint via GoogleDrive, or TsinghuaCloud.

ProteinLM (3B)

For the pretrained model with 3 billion parameters, you can download model checkpoint from here.

Project Structure

.
├── pretrain                (protein language model pretrain)
│   ├── megatron            (model folder)
│   ├── pretrain_tools      (multi-node pretrain)
│   ├── protein_tools       (data preprocess shells)
└── tape
    ├── conda_env           (conda env in yaml format)
    ├── converter           (converter script and model config files)
    ├── scripts             (model generator, finetune)
    └── tape                (tape model)

Usage

As the structure above shows, there are two stages as follows.

Pretrain
- Prepare dataset (PFAM)
- Preprocess data
- Pretrain
Finetune
- Convert pretrain protein model checkpoint
- Finetune on downstream tasks

Detailed explanations are given in each folder's readme.

Downstream Tasks Performance

Task	Metric	TAPE	ProteinLM (200M)	ProteinLM (3B)
contact prediction	P@L/5	0.36	0.52	0.75
remote homology	Top 1 Accuracy	0.21	0.26	0.30
secondary structure	Accuracy (3-class)	0.73	0.75	0.79
fluorescence	Spearman's rho	0.68	0.68	0.68
stability	Spearman's rho	0.73	0.77	0.79

Contact

If you have any problem using ProteinLM, feel free to contact us.

Reference

Our work is based on the following papers.

Besides, part of the code is based on Megatron-LM and TAPE.

Evaluating Protein Transfer Learning with TAPE

@article{DBLP:journals/corr/abs-1909-08053,
  author    = {Mohammad Shoeybi and
               Mostofa Patwary and
               Raul Puri and
               Patrick LeGresley and
               Jared Casper and
               Bryan Catanzaro},
  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
               Model Parallelism},
  journal   = {CoRR},
  volume    = {abs/1909.08053},
  year      = {2019},
  url       = {http://arxiv.org/abs/1909.08053},
  archivePrefix = {arXiv},
  eprint    = {1909.08053},
  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{DBLP:journals/corr/abs-1906-08230,
  author    = {Roshan Rao and
               Nicholas Bhattacharya and
               Neil Thomas and
               Yan Duan and
               Xi Chen and
               John F. Canny and
               Pieter Abbeel and
               Yun S. Song},
  title     = {Evaluating Protein Transfer Learning with {TAPE}},
  journal   = {CoRR},
  volume    = {abs/1906.08230},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.08230},
  archivePrefix = {arXiv},
  eprint    = {1906.08230},
  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Comments

Extracting Embedding

Hello,

I found a similar issue ( #5), but when I try to extract the embedding according to your instruction, the dimension of the transformer_output doesn't seem to fit the input. The dimension of the transformer_output is like [micro-batch-size, max seg length, hidden size], but how does this output match the input? I checked the bert model and its output should be like output1, output2. The output2 is the embedding that can match the input. But in your code, the output2 is None. I wonder how we can get the embedding corresponding to the inputs.

Thanks

opened by funihang 1
I tried to run, but after start pretraining task, the process kills itself. can you help?

roteinlm)xxxx@quant:~/ProteinLM/pretrain$ sh examples/pretrain_tape.sh using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... WARNING: overriding default arguments for tokenizer_type:BertWordPieceLowerCase with tokenizer_type:BertWordPieceCase ------------------------ arguments ------------------------ adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_load ....................................... None bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True block_data_path ................................. None checkpoint_activations .......................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... mmap data_parallel_size .............................. 1 data_path ....................................... ['my-tape_text_sentence'] DDP_impl ........................................ local distribute_checkpointed_activations ............. False distributed_backend ............................. nccl eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 10 exit_duration_in_mins ........................... None exit_interval ................................... None faiss_use_gpu ................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_allreduce .................................. False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 768 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 initial_loss_scale .............................. 4294967296 layernorm_epsilon ............................... 1e-12 lazy_mpu_init ................................... None load ............................................ ./checkopoint local_rank ...................................... None log_interval .................................... 100 loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. 990000 lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 max_position_embeddings ......................... 2176 merge_file ...................................... None micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... False no_load_rng ..................................... False no_save_optim ................................... False no_save_rng ..................................... False num_attention_heads ............................. 12 num_layers ...................................... 12 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False override_lr_scheduler ........................... False params_dtype .................................... torch.float16 pipeline_model_parallel_size .................... 1 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 report_topk_accuracies .......................... [] reset_attention_mask ............................ False reset_position_ids .............................. False save ............................................ ./checkopoint save_interval ................................... 10000 scaled_masked_softmax_fusion .................... True scaled_upper_triang_masked_softmax_fusion ....... None seed ............................................ 1234 seq_length ...................................... 2176 short_seq_prob .................................. 0.1 split ........................................... 32593668,1715454,44311 tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None titles_data_path ................................ None tokenizer_type .................................. BertWordPieceCase train_iters ..................................... 2000000 train_samples ................................... None use_checkpoint_lr_scheduler ..................... False use_cpu_initialization .......................... False use_one_sent_docs ............................... False vocab_file ...................................... ./protein_tools/iupac_vocab.txt weight_decay .................................... 0.01 world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

building BertWordPieceCase tokenizer ... padded vocab (size: 31) with 97 dummy tokens (new size: 128) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 time to initialize megatron (seconds): 74.673 [after megatron is initialized] datetime: 2022-02-09 00:02:02 building TAPE model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 87417728 learning rate decay style: linear WARNING: could not find the metadata file ./checkopoint/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random time (ms) | load checkpoint: 10.21 [after model, optimizer, and learning rate scheduler are built] datetime: 2022-02-09 00:02:02 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 16000000 validation: 160080 test: 80 building train, validation, and test datasets for TAPE ... building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... finished creating indexed dataset in 0.013824 seconds number of documents: 32593668 dataset split: train: document indices in [0, 30924048) total of 30924048 documents validation: document indices in [30924048, 32551627) total of 1627579 documents test: document indices in [32551627, 32593668) total of 42041 documents WARNING: could not find index map files, building the indices on rank 0 ... last epoch number of samples (26365) is larger than 80% of number of samples per epoch (28422), setting separate_last_epoch to False Killed

opened by usccolumbia 1
format of the sequence json file, which one?

which format should be sequence json file? do we need to add spaces between amino acids?

in: https://github.com/THUDM/ProteinLM/tree/main/pretrain {"text": "GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ"} {"text": "RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR"}

https://github.com/THUDM/ProteinLM/tree/main/pretrain/protein_tools {"text": "G C T V E D R C L I G M G A I L L N G C V I G S G S L V A A G A L I T Q "} {"text": "A D G I N L E I P R G E W I S V I G G N G S G K S T F L K S L I R L E A V K K G R I Y L E G R E L K K W S D R T L Y E K A G F V F Q N P E L Q F I R D T V F D E I A F G A R Q R S W P E E Q V E R K T A E L L Q E F G L D G H Q K A H P F T L S L G Q K R R L S V A T M L L F D Q D L L L L D E P T F "}

opened by usccolumbia 1
Data download not working

It seems like the data download is not working.

For example: wget http://s3.amazonaws.com/proteindata/data_pytorch/pfam.tar.gz terminates with a HTTP 403：Forbidden

and the other data is the same.

opened by fuxuliu 1
Possible checkpoint loss in TAPE pretraining scripts
In pretraining scripts for TAPE mode, there are two commands removing all contents in checkpoint folder, this may cause lost of previous checkpoints.

https://github.com/THUDM/ProteinLM/blob/main/pretrain/examples/pretrain_tape_distributed.sh#L17 https://github.com/THUDM/ProteinLM/blob/main/pretrain/examples/pretrain_tape.sh#L8
opened by Yijia-Xiao 0

Protein Language Model

Related tags

Overview

ProteinLM

Overview

Guidance

Download ProteinLM

ProteinLM (200M)

ProteinLM (3B)

Project Structure

Usage

Downstream Tasks Performance

Contact

Reference

Comments

Extracting Embedding

I tried to run, but after start pretraining task, the process kills itself. can you help?

format of the sequence json file, which one?

Data download not working

Possible checkpoint loss in TAPE pretraining scripts

Owner

THUDM

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

NL. The natural language programming language.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

A simple implementation of N-gram language model.

PyTorch original implementation of Cross-lingual Language Model Pretraining.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model