Protein Language Model

Overview

ProteinLM

We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.

Overview

The proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data.

Guidance

We provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.

Download ProteinLM

ProteinLM (200M)

For the pretrained model with 200 million parameters, you can download model checkpoint via GoogleDrive, or TsinghuaCloud.

ProteinLM (3B)

For the pretrained model with 3 billion parameters, you can download model checkpoint from here.

Project Structure

.
├── pretrain                (protein language model pretrain)
│   ├── megatron            (model folder)
│   ├── pretrain_tools      (multi-node pretrain)
│   ├── protein_tools       (data preprocess shells)
└── tape
    ├── conda_env           (conda env in yaml format)
    ├── converter           (converter script and model config files)
    ├── scripts             (model generator, finetune)
    └── tape                (tape model)

Usage

As the structure above shows, there are two stages as follows.

  • Pretrain
    • Prepare dataset (PFAM)
    • Preprocess data
    • Pretrain
  • Finetune
    • Convert pretrain protein model checkpoint
    • Finetune on downstream tasks

Detailed explanations are given in each folder's readme.

Downstream Tasks Performance

Task Metric TAPE ProteinLM (200M) ProteinLM (3B)
contact prediction P@L/5 0.36 0.52 0.75
remote homology Top 1 Accuracy 0.21 0.26 0.30
secondary structure Accuracy (3-class) 0.73 0.75 0.79
fluorescence Spearman's rho 0.68 0.68 0.68
stability Spearman's rho 0.73 0.77 0.79

Contact

If you have any problem using ProteinLM, feel free to contact us.

Reference

Our work is based on the following papers.

Besides, part of the code is based on Megatron-LM and TAPE.

Evaluating Protein Transfer Learning with TAPE

@article{DBLP:journals/corr/abs-1909-08053,
  author    = {Mohammad Shoeybi and
               Mostofa Patwary and
               Raul Puri and
               Patrick LeGresley and
               Jared Casper and
               Bryan Catanzaro},
  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
               Model Parallelism},
  journal   = {CoRR},
  volume    = {abs/1909.08053},
  year      = {2019},
  url       = {http://arxiv.org/abs/1909.08053},
  archivePrefix = {arXiv},
  eprint    = {1909.08053},
  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{DBLP:journals/corr/abs-1906-08230,
  author    = {Roshan Rao and
               Nicholas Bhattacharya and
               Neil Thomas and
               Yan Duan and
               Xi Chen and
               John F. Canny and
               Pieter Abbeel and
               Yun S. Song},
  title     = {Evaluating Protein Transfer Learning with {TAPE}},
  journal   = {CoRR},
  volume    = {abs/1906.08230},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.08230},
  archivePrefix = {arXiv},
  eprint    = {1906.08230},
  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
Comments
  • Extracting Embedding

    Extracting Embedding

    Hello,

    I found a similar issue ( #5), but when I try to extract the embedding according to your instruction, the dimension of the transformer_output doesn't seem to fit the input. The dimension of the transformer_output is like [micro-batch-size, max seg length, hidden size], but how does this output match the input? I checked the bert model and its output should be like output1, output2. The output2 is the embedding that can match the input. But in your code, the output2 is None. I wonder how we can get the embedding corresponding to the inputs.

    Thanks

    opened by funihang 1
  • I tried to run, but after start pretraining task, the process kills itself. can you help?

    I tried to run, but after start pretraining task, the process kills itself. can you help?

    roteinlm)xxxx@quant:~/ProteinLM/pretrain$ sh examples/pretrain_tape.sh using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... WARNING: overriding default arguments for tokenizer_type:BertWordPieceLowerCase with tokenizer_type:BertWordPieceCase ------------------------ arguments ------------------------ adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_load ....................................... None bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True block_data_path ................................. None checkpoint_activations .......................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... mmap data_parallel_size .............................. 1 data_path ....................................... ['my-tape_text_sentence'] DDP_impl ........................................ local distribute_checkpointed_activations ............. False distributed_backend ............................. nccl eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 10 exit_duration_in_mins ........................... None exit_interval ................................... None faiss_use_gpu ................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_allreduce .................................. False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 768 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 initial_loss_scale .............................. 4294967296 layernorm_epsilon ............................... 1e-12 lazy_mpu_init ................................... None load ............................................ ./checkopoint local_rank ...................................... None log_interval .................................... 100 loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. 990000 lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 max_position_embeddings ......................... 2176 merge_file ...................................... None micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... False no_load_rng ..................................... False no_save_optim ................................... False no_save_rng ..................................... False num_attention_heads ............................. 12 num_layers ...................................... 12 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False override_lr_scheduler ........................... False params_dtype .................................... torch.float16 pipeline_model_parallel_size .................... 1 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 report_topk_accuracies .......................... [] reset_attention_mask ............................ False reset_position_ids .............................. False save ............................................ ./checkopoint save_interval ................................... 10000 scaled_masked_softmax_fusion .................... True scaled_upper_triang_masked_softmax_fusion ....... None seed ............................................ 1234 seq_length ...................................... 2176 short_seq_prob .................................. 0.1 split ........................................... 32593668,1715454,44311 tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None titles_data_path ................................ None tokenizer_type .................................. BertWordPieceCase train_iters ..................................... 2000000 train_samples ................................... None use_checkpoint_lr_scheduler ..................... False use_cpu_initialization .......................... False use_one_sent_docs ............................... False vocab_file ...................................... ./protein_tools/iupac_vocab.txt weight_decay .................................... 0.01 world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

    building BertWordPieceCase tokenizer ... padded vocab (size: 31) with 97 dummy tokens (new size: 128) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 time to initialize megatron (seconds): 74.673 [after megatron is initialized] datetime: 2022-02-09 00:02:02 building TAPE model ... number of parameters on (tensor, pipeline) model parallel rank (0, 0): 87417728 learning rate decay style: linear WARNING: could not find the metadata file ./checkopoint/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random time (ms) | load checkpoint: 10.21 [after model, optimizer, and learning rate scheduler are built] datetime: 2022-02-09 00:02:02 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 16000000 validation: 160080 test: 80 building train, validation, and test datasets for TAPE ... building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... finished creating indexed dataset in 0.013824 seconds number of documents: 32593668 dataset split: train: document indices in [0, 30924048) total of 30924048 documents validation: document indices in [30924048, 32551627) total of 1627579 documents test: document indices in [32551627, 32593668) total of 42041 documents WARNING: could not find index map files, building the indices on rank 0 ... last epoch number of samples (26365) is larger than 80% of number of samples per epoch (28422), setting separate_last_epoch to False Killed

    opened by usccolumbia 1
  • format of the sequence json file, which one?

    format of the sequence json file, which one?

    which format should be sequence json file? do we need to add spaces between amino acids?

    in: https://github.com/THUDM/ProteinLM/tree/main/pretrain {"text": "GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ"} {"text": "RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR"}

    https://github.com/THUDM/ProteinLM/tree/main/pretrain/protein_tools {"text": "G C T V E D R C L I G M G A I L L N G C V I G S G S L V A A G A L I T Q "} {"text": "A D G I N L E I P R G E W I S V I G G N G S G K S T F L K S L I R L E A V K K G R I Y L E G R E L K K W S D R T L Y E K A G F V F Q N P E L Q F I R D T V F D E I A F G A R Q R S W P E E Q V E R K T A E L L Q E F G L D G H Q K A H P F T L S L G Q K R R L S V A T M L L F D Q D L L L L D E P T F "}

    opened by usccolumbia 1
  • Data download not working

    Data download not working

    It seems like the data download is not working.

    For example: wget http://s3.amazonaws.com/proteindata/data_pytorch/pfam.tar.gz terminates with a HTTP 403:Forbidden

    and the other data is the same.

    opened by fuxuliu 1
  • Possible checkpoint loss in TAPE pretraining scripts

    Possible checkpoint loss in TAPE pretraining scripts

    In pretraining scripts for TAPE mode, there are two commands removing all contents in checkpoint folder, this may cause lost of previous checkpoints.

    https://github.com/THUDM/ProteinLM/blob/main/pretrain/examples/pretrain_tape_distributed.sh#L17
    https://github.com/THUDM/ProteinLM/blob/main/pretrain/examples/pretrain_tape.sh#L8
    
    opened by Yijia-Xiao 0
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

catdochrome 2 Dec 18, 2021
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

null 2 Jan 17, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

null 4 Nov 24, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 4, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 1, 2023