ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Overview

What is ProteinBERT?

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

ProteinBERT's deep-learning architecture is inspired by BERT, but it contains several innovations such as its global-attention layers that grow only lineraly with sequence length (compared to self-attention's quadratic growth). As a result, the model can process protein sequences of almost any length, includng extremely long protein sequences (of over tens of thousands of amino acids).

The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly). This package provides seamless access to a pretrained state that has been produced by training the model for 28 days over ~670M records (i.e. ~6.4 iterations over the entire training dataset of ~106M records). For users interested in pretraining the model from scratch, the package also includes scripts for that.

Installation

Dependencies

ProteinBERT requires Python 3.

Below are the Python packages required by ProteinBERT, which are automatically installed with it (and the versions of these packages that were tested with ProteinBERT 1.0.0):

  • tensorflow (2.4.0)
  • tensorflow_addons (0.12.1)
  • numpy (1.20.1)
  • pandas (1.2.3)
  • h5py (3.2.1)
  • lxml (4.3.2)
  • pyfaidx (0.5.8)

Install ProteinBERT

Just run:

pip install protein-bert

Alternatively, clone this repository and run:

python setup.py install

Using ProteinBERT

Fine-tuning ProteinBERT is very easy. You can see some working examples in this notebook.

Pretraining ProteinBERT from scratch

If, instead of using the existing pretrained model weights, you would like to train it from scratch, then follow the steps below. We warn you however that this is a long process (we pretrained the current model for a whole month), and it also requires a lot of storage (>1TB).

Step 1: Create the UniRef dataset

ProteinBERT is pretrained on a dataset derived from UniRef90. Follow these steps to produce this dataset:

  1. First, choose a working directory with sufficient (>1TB) free storage.
cd /some/workdir
  1. Download the metadata of GO from CAFA and extract it.
wget https://www.biofunctionprediction.org/cafa-targets/cafa4ontologies.zip
mkdir cafa4ontologies
unzip cafa4ontologies.zip -d cafa4ontologies/
  1. Download UniRef90, as both XML and FASTA.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.xml.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
gunzip uniref90.fasta.gz
  1. Use the create_uniref_db script provided by ProteinBERT to extract the GO annotations associated with UniRef's records into an SQLite database (and a CSV file with the metadata of these GO annotations). Since this is a long process (which can take up to a few days), it is recommended to run this in the background (e.g. using nohup).
nohup create_uniref_db --uniref-xml-gz-file=./uniref90.xml.gz --go-annotations-meta-file=./cafa4ontologies/go.txt --output-sqlite-file=./uniref_proteins_and_annotations.db --output-go-annotations-meta-csv-file=./go_annotations.csv >&! ./log_create_uniref_db.txt &
  1. Create the final dataset (in the H5 format) by merging the database of GO annotations with the protein sequences using the create_uniref_h5_dataset script provided by ProteinBERT. This is also a long process that should be let to run in the background.
nohup create_uniref_h5_dataset --protein-annotations-sqlite-db-file=./uniref_proteins_and_annotations.db --protein-fasta-file=./uniref90.fasta --go-annotations-meta-csv-file=./go_annotations.csv --output-h5-dataset-file=./dataset.h5 --min-records-to-keep-annotation=100 >&! ./log_create_uniref_h5_dataset.txt &
  1. Finally, use ProteinBERT's set_h5_testset script to designate which of the dataset records will be considered part of the test set (so that their GO annotations are not used during pretraining). If you are planning to evaluate your model on certain downstream benchmarks, it is recommended that any UniRef record similar to a test-set protein in these benchmark will be considered part of the pretraining's test set. You can use BLAST to find all of these UniRef records and provide them to set_h5_testset through the flag --uniprot-ids-file=./uniref_90_seqs_matching_test_set_seqs.txt, where the provided text file contains the UniProt IDs of the relevant records, one per line (e.g. A0A009EXK6_ACIBA).
set_h5_testset --h5-dataset-file=./dataset.h5

Step 2: Pretrain ProteinBERT on the UniRef dataset

Once you have the dataset ready, the pretrain_proteinbert script will train a ProteinBERT model on that dataset.

Basic use of the pretraining script looks as follows:

mkdir -p ~/proteinbert_models/new
nohup pretrain_proteinbert --dataset-file=./dataset.h5 --autosave-dir=~/proteinbert_models/new >&! ~/proteinbert_models/log_new_pretraining.txt &

By running that, ProteinBERT will continue to train indefinitely. Therefore, make sure to run it in the background using nohup or other options. Every given number of epochs (determined as 100 batches) the model state will be automatically saved into the specified autosave directory. If this process is interrupted and you wish to resume pretraining from a given snapshot (e.g. the most up-to-date state file within the autosave dir) use the --resume-from flag (provide it the state file that you wish to resume from).

pretrain_proteinbert has MANY options and hyper-parameters that are worth checking out:

pretrain_proteinbert --help

Step 3: Use your pretrained model state when fine-tuning ProteinBERT

Normally the function load_pretrained_model is used to load the existing pretrained model state. If you wish to load your own pretrained model state instead, then use the load_pretrained_model_from_dump function instead.

License

ProteinBERT is a free open-source project available under the MIT License.

Cite us

If you use ProteinBERT as part of a work contributing to a scientific publication, we ask that you cite our paper: Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. bioRxiv (2021). https://doi.org/10.1101/2021.05.24.445464

Comments
  • How to extract protein embeddings/representations learned by the model?

    How to extract protein embeddings/representations learned by the model?

    Hi, Could you please tell how can I extract the embeddings learned by the model, similar to FAIR's ESM model? How do I parse a .fasta file to the model and extract the representations?

    Thanks

    opened by xinformatics 36
  • save finetuned model

    save finetuned model

    Hi, I have a question about saving a fine tuned model plese.

    After fine tuning i'm using save_weights such as:

    finetune(model_generator,  ...)
    ...
    model_generator.save_weights('fine_tuned_model.h5')
    

    and if i want to use the model later:

    pretrained_model_generator, input_encoder = load_pretrained_model(local_model_dump_dir='')
    model_generator = FinetuningModelGenerator(
        pretrained_model_generator, 
        OUTPUT_SPEC, 
        pretraining_model_manipulation_function = get_model_with_hidden_layers_as_outputs,
        dropout_rate = 0.5)
    fine_tuned_model = model_generator.create_model(512)
    fine_tuned_model.load_weights('fine_tuned_model.h5')
    

    where OUTPUT_SPEC is the same that I used to fine tune the model

    Is this ok?

    opened by mtinti 9
  • How to customize different fine-tuning task like protein-protein interaction (PPI) ?

    How to customize different fine-tuning task like protein-protein interaction (PPI) ?

    Hello: As mentioned in the title, I wonder how to customize fine-tuning task like protein-protein interaction? I am not sure which part of finetuning.py (or perhaps FinetuningModelGenerator in model_generation.py) should I change; I checked all fine-tuning task in this repo, always one protein seq in one single sample. The fellowing CSV format is the dataset I want to use in my task. Many thanks.

    |label|seq_A |seq_B | |-----|----------------------------------------------------------------------|----------------------------------------------------------------------| |0 |MAVSVTPIRDTKWLTLEVCREFQRGTCSRPDTECKFAHPSKSCQVENGRVIACFDSLKGRCSRENCKYLH|MATTNSFTILIFMILATTSSTFATLGEMVTVLSIDGGGIKGIIPATILEFLEGQLQEVDNNTDARLADYF| |1 |MCCEKWSRVAEMFLFIEEREDCKILCLCSRAFVEDRKLYNLGLKGYYIRDSGNNSGDQATEEEEGGYSCG|MSASSRFIPEHRRQNYKGKGTFQADELRRRRETQQIEIRKQKREENLNKRRNLVDVQEPAEETIPLEQDK|

    opened by zrhsu0911 4
  • Hello World

    Hello World

    Can you please post a hello world code example? I need to use the pre-trained model to predict one missing residue in a single sequence. I have spent hours trying to figure it out using the demo examples and code snippets in the issues, but haven't had any success. I would be really grateful for any help with this.

    opened by BiochemStudent2 4
  • chunk_size value

    chunk_size value

    I figured out the model.fit takes batch_size * batches_per_epoch samples. However, we import 100,000 samples each time we need new data (chunk_size). Can we reduce this number to batch_size * batches_per_epoch samples so that the memory usage decreases? (in case of fixed batch_size=64)

    opened by dsaeedeh 4
  • Error whilst evaluating fine-tuned model with categorical GO terms

    Error whilst evaluating fine-tuned model with categorical GO terms

    Hello,

    I am specifically looking to use the model to predict go terms on sequences less than 150 AAs in length so I am attempting to fine tune the model to my dataset of small sequences. The fine-tune process seems successful, however when I run the evaluate_by_len method in finetuning.py I get an out of memory error (see below). The error is originating on line 90 from y_pred = model.predict(X, batch_size = batch_size) I have reduced the batch_size right down to 2 in an attempt to prevent the error however I have had no success. I just wanted to check that I am inputting the data correctly and that it isn't a problem with my code.

    Code & inputs into methods
     save_data = {
          "benchmark_name": BENCHMARK_NAME,
          "samples": [('Training-set', (train_set_seqs, train_set_labels)), ('Validation-set', (valid_set_seqs, valid_set_labels))],
          "model_generator": model_generator,
          "input_encoder": input_encoder,
          "output_spec": output_spec,
          "start_seq_len" : settings['seq_len'],
          "start_batch_size": settings['batch_size']
        }
    
    for dataset_name, dataset in saved_data["samples"]:
              log('*** %s performance: ***' % dataset_name)
              log('batch size: ', saved_data["start_batch_size"])
              results, confusion_matrix = evaluate_by_len(saved_data["model_generator"], saved_data["input_encoder"], saved_data["output_spec"], dataset[0], dataset[1], \
                      start_seq_len = saved_data["start_seq_len"], start_batch_size = saved_data["start_batch_size"])
    

    (These are entered as dataframes which I assume is the correct input format, but just want to check.) dataset[0]: (dataframes.head() method)

    Name: seq, dtype: object
    P82299                               GAYGQGQNIGQLFVNILIFLFY
    O46577    SVVKSEDFSLPAYMDRRDHPLPEVAHVKHLSASQKALKEKEKASWS...
    A0QKT3    MPTYAPKAGDTTRSWYVIDATDVVLGRLAVAAANLLRGKHKPTFAP...
    Q4QM28    MYAVFQSGGKQHRVSEGQVVRLEKLELATGATVEFDSVLMVVNGED...
    A7GR09    MKWWKLSGQILLLFCFAWTGEWIAKQVHLPIPGSIIGIFLLLISLK...
    

    dataset[1]:

    P82299                             [GO:0005615, GO:0005344]
    O46577    [GO:0016021, GO:0005743, GO:0005751, GO:000412...
    A0QKT3                 [GO:0005840, GO:0003735, GO:0006412]
    Q4QM28     [GO:0005840, GO:0019843, GO:0003735, GO:0006412]
    A7GR09     [GO:0016021, GO:0005886, GO:0019835, GO:0012501]
    
    OOM Error
    2022-03-01 15:33:19.980713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
    pciBusID: 0000:84:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
    coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
    2022-03-01 15:33:19.980756: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
    2022-03-01 15:33:19.980797: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
    2022-03-01 15:33:19.980817: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
    2022-03-01 15:33:19.980840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
    2022-03-01 15:33:19.980859: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
    2022-03-01 15:33:19.980881: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
    2022-03-01 15:33:19.980904: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
    2022-03-01 15:33:19.980948: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
    2022-03-01 15:33:19.981759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
    2022-03-01 15:33:19.981796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
    2022-03-01 15:33:20.725621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
    2022-03-01 15:33:20.725680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1 
    2022-03-01 15:33:20.725691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N N 
    2022-03-01 15:33:20.725697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   N N 
    2022-03-01 15:33:20.726963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:08:00.0, compute capability: 6.0)
    2022-03-01 15:33:20.728077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11119 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:84:00.0, compute capability: 6.0)
    2022-03-01 15:33:22.452796: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
    2022-03-01 15:33:22.453398: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3196210000 Hz
    2022-03-01 15:33:23.836672: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
    2022-03-01 15:33:24.074444: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
    2022-03-01 15:33:24.080029: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
    2022-03-01 15:33:24.943426: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
    2022-03-01 15:33:24.995202: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
    Relying on driver to perform ptx compilation. 
    Modify $PATH to customize ptxas location.
    This message will be only logged once.
    2022-03-01 15:33:54.356200: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 128.0KiB (rounded to 131072)requested by op model_2/global-attention-block5/transpose_3
    Current allocation summary follows.
    2022-03-01 15:33:54.356891: I tensorflow/core/common_runtime/bfc_allocator.cc:972] BFCAllocator dump for GPU_0_bfc
    2022-03-01 15:33:54.356918: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (256): 	Total Chunks: 31, Chunks in use: 31. 7.8KiB allocated for chunks. 7.8KiB in use in bin. 240B client-requested in use in bin.
    2022-03-01 15:33:54.356930: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (512): 	Total Chunks: 50, Chunks in use: 50. 25.5KiB allocated for chunks. 25.5KiB in use in bin. 25.2KiB client-requested in use in bin.
    2022-03-01 15:33:54.356939: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (1024): 	Total Chunks: 2, Chunks in use: 1. 2.8KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
    2022-03-01 15:33:54.356948: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (2048): 	Total Chunks: 38, Chunks in use: 38. 76.5KiB allocated for chunks. 76.5KiB in use in bin. 76.0KiB client-requested in use in bin.
    2022-03-01 15:33:54.356956: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (4096): 	Total Chunks: 1, Chunks in use: 0. 5.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
    2022-03-01 15:33:54.356965: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (8192): 	Total Chunks: 2, Chunks in use: 1. 26.0KiB allocated for chunks. 13.0KiB in use in bin. 13.0KiB client-requested in use in bin.
    2022-03-01 15:33:54.356981: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (16384): 	Total Chunks: 3, Chunks in use: 1. 71.5KiB allocated for chunks. 20.5KiB in use in bin. 20.3KiB client-requested in use in bin.
    2022-03-01 15:33:54.356990: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (32768): 	Total Chunks: 1, Chunks in use: 1. 51.0KiB allocated for chunks. 51.0KiB in use in bin. 34.9KiB client-requested in use in bin.
    2022-03-01 15:33:54.356999: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (65536): 	Total Chunks: 8, Chunks in use: 8. 661.2KiB allocated for chunks. 661.2KiB in use in bin. 557.5KiB client-requested in use in bin.
    2022-03-01 15:33:54.357008: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (131072): 	Total Chunks: 11, Chunks in use: 11. 1.50MiB allocated for chunks. 1.50MiB in use in bin. 1.32MiB client-requested in use in bin.
    2022-03-01 15:33:54.357016: I tensorflow/core/common_runtime/bfc_allocator.cc:979] Bin (262144): 	Total Chunks: 17, Chunks in use: 17. 4.41MiB allocated for chunks. 4.41MiB in use in bin. 4.25MiB client-requested in use in bin.
    ..... (A lot more chunk messages)
    memory_limit_: 11659697216 available bytes: 64 curr_region_allocation_bytes_: 23319394816
    2022-03-01 15:34:14.429783: I tensorflow/core/common_runtime/bfc_allocator.cc:1048] Stats: 
    Limit:                     11659697216
    InUse:                     11659624704
    MaxInUse:                  11659624704
    NumAllocs:                     1112241
    MaxAllocSize:                592642048
    Reserved:                            0
    PeakReserved:                        0
    LargestFreeBlock:                    0
    Traceback (most recent call last):
      File "/home/mah51/files/protein_bert/evaluate_go.py", line 37, in <module>
        run_job()
      File "/home/mah51/files/protein_bert/evaluate_go.py", line 27, in run_job
        start_seq_len = saved_data["start_seq_len"], start_batch_size = saved_data["start_batch_size"])
      File "/shared/home/mah51/files/protein_bert/proteinbert/finetuning.py", line 93, in evaluate_by_len
        y_pred = model.predict(X, batch_size = 1)
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1629, in predict
        tmp_batch_outputs = self.predict_function(iterator)
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
        result = self._call(*args, **kwds)
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 862, in _call
        results = self._stateful_fn(*args, **kwds)
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
        filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
        ctx, args, cancellation_manager=cancellation_manager))
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call
        ctx=ctx)
      File "/home/mah51/anaconda/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
        inputs, attrs, num_outputs)
    tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[128,4,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
    	 [[node model_2/global-attention-block5/transpose_3 (defined at shared/home/mah51/files/protein_bert/proteinbert/conv_and_global_attention_model.py:77) ]]
    Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [Op:__inference_predict_function_6175]
    
    Function call stack:
    predict_function
    

    Thank you in advance for any help!

    opened by mah51 4
  • WARNING:tensorflow:`evaluate()` received a value for `sample_weight`, but `weighted_metrics` were not provided

    WARNING:tensorflow:`evaluate()` received a value for `sample_weight`, but `weighted_metrics` were not provided

    Hi,

    I'm getting the following warning when when running the "Run all benchmarks" section of the demo notebook with some of my own data. Is this something to worry about? The fine-tuning finishes and yields results for training/test/validation. Could it be related to using CPUs not GPUs?

    WARNING:tensorflow:`evaluate()` received a value for `sample_weight`, but `weighted_metrics` were not provided. Did you mean to pass metrics to `weighted_metrics` in `compile() ? If this is intentional you can pass `weighted_metrics=[]` to `compile()` in order to silence this warning.

    Also, while at it, would you recommend to experiment with any of the settings in that block (lr, dropout, batch size, etc)? I wasn't sure if any parameters are being varied since a validation set is present or if changing something manually might be of benefit.

    Thank you!

    opened by das22 3
  • The annotation number does not match your pretrained model

    The annotation number does not match your pretrained model

    Hi Protein_bert team,

    Thanks for providing such useful model. I found one weird thing about your pretrained model, after I created the uniref90 database and merged with GO database through the pipeline you provided, I got 9211 annotation records. This is different from your number in the manuscript, 8943. Would you mind taking a look into this?

    opened by zhongguojie1998 3
  • Inputs types while pretraining alone

    Inputs types while pretraining alone

    Hi @nadavbra,

    Im trying to pretrain the model on antibodies dataset (without annotations). Does the next screenshot is a good example of inputs to the model? in the sense of types (for example - does seqs should be an array of strings? or maybe array of arrays?)

    image

    Thank you very much for your help Alon

    opened by fridalon 3
  • Pretraining from scratch - Error while creating the h5 dataset file?

    Pretraining from scratch - Error while creating the h5 dataset file?

    Hi,

    I'm trying to pretrain from scratch with UniRef50. Until running the script "create_uniref_h5_dataset" (step 1 (5)) it seems like everything's going fine. After running this script, the output log seems like it did some work (took 5.5 hours) except that the 2 last lines are - 'Finished. Failed finding the sequence for 51333317 of 51333317 records.' 'Done'. And - the dataset.h5 file weight 290Kb. Is this really an error? the size is too small for this file?

    Thank you for your help, Alon

    opened by fridalon 3
  • cannot load pkl when loading the pre-trained protein_bert model

    cannot load pkl when loading the pre-trained protein_bert model

    Hi nadavbra,

    I'm running into the same issue as WangShixianChina (commented on Sep 29). When I try to load the pret-rained model in my protein_bert.py script:


    from tensorflow import keras from proteinbert import OutputType, OutputSpec, FinetuningModelGenerator, load_pretrained_model, finetune, evaluate_by_len from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs

    pretrained_model_generator, input_encoder = load_pretrained_model()


    I get an error message stating: pickle.UnpicklingError: pickle data was truncated.

    I need help unloading/unpickling the epoch_92400_sample_23500000.pkl file. I tried to install the exact or closest Python packages required by ProteinBERT. I also show the stack trace when I try to load the pretrained model (see below).

    $ pip install tensorflow==2.4.0 Collecting tensorflow==2.4.0 Successfully installed gast-0.3.3 grpcio-1.32.0 numpy-1.19.5 tensorflow-2.4.0 tensorflow-estimator-2.4.0

    $ pip install tensorflow_addons==0.12.1 Successfully installed tensorflow_addons==0.12.1

    $ pip install numpy==1.20.1 ERROR: Could not find a version that satisfies the requirement numpy==1.20.1 (from versions: 1.3.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.10.0.post2, 1.10.1, 1.10.2, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.13.0rc1, 1.13.0rc2, 1.13.0, 1.13.1, 1.13.3, 1.14.0rc1, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4, 1.16.0rc1, 1.16.0rc2, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6, 1.17.0rc1, 1.17.0rc2, 1.17.0, 1.17.1, 1.17.2, 1.17.3, 1.17.4, 1.17.5, 1.18.0rc1, 1.18.0, 1.18.1, 1.18.2, 1.18.3, 1.18.4, 1.18.5, 1.19.0rc1, 1.19.0rc2, 1.19.0, 1.19.1, 1.19.2, 1.19.3, 1.19.4, 1.19.5) ERROR: No matching distribution found for numpy==1.20.1 $ pip install numpy==1.19.5 Requirement already satisfied: numpy==1.19.5

    $ pip install pandas==1.2.3 ERROR: Could not find a version that satisfies the requirement pandas==1.2.3 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5) ERROR: No matching distribution found for pandas==1.2.3 $ pip install pandas==1.1.5 Successfully installed pandas-1.1.5

    $ pip install h5py==3.2.1 ERROR: Could not find a version that satisfies the requirement h5py==3.2.1 (from versions: 2.2.1, 2.3.0b1, 2.3.0, 2.3.1, 2.4.0b1, 2.4.0, 2.5.0, 2.6.0, 2.7.0rc2, 2.7.0, 2.7.1, 2.8.0rc1, 2.8.0, 2.9.0rc1, 2.9.0, 2.10.0, 3.0.0rc1, 3.0.0, 3.1.0) ERROR: No matching distribution found for h5py==3.2.1 $ pip install h5py==3.1.0 Successfully installed cached-property-1.5.2 h5py-3.1.0

    $ pip install lxml==4.3.2 Successfully installed lxml-4.3.2

    $ pip install pyfaidx==0.5.8 Successfully installed pyfaidx-0.5.8

    $ python protein_bert.py 2021-12-02 08:33:15.585243: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-12-02 08:33:15.585275: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Traceback (most recent call last): File "protein_bert.py", line 40, in pretrained_model_generator, input_encoder = load_pretrained_model() File "/home/williamsawran/anaconda3/envs/protein_bert/lib/python3.6/site-packages/proteinbert/existing_model_loading.py", line 53, in load_pretrained_model other_optimizer_kwargs = other_optimizer_kwargs, annots_loss_weight = annots_loss_weight, load_optimizer_weights = load_optimizer_weights) File "/home/williamsawran/anaconda3/envs/protein_bert/lib/python3.6/site-packages/proteinbert/model_generation.py", line 159, in load_pretrained_model_from_dump n_annotations, model_weights, optimizer_weights = pickle.load(f) _pickle.UnpicklingError: pickle data was truncated

    opened by WilliamSawran 3
  • What to do with the local_representations and global_representations

    What to do with the local_representations and global_representations

    Hello everyone,

    After using the model I have two array that are the local_representations and global_representations

    # After parsing the sequences from the FASTA file into 'seqs' and choosing 'seq_len' (e.g. 512) and 'batch_size' (e.g. 32)
    
    from proteinbert import load_pretrained_model
    from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs
    
    pretrained_model_generator, input_encoder = load_pretrained_model()
    model = get_model_with_hidden_layers_as_outputs(pretrained_model_generator.create_model(seq_len))
    X = input_encoder.encode_X(seqs, seq_len)
    local_representations, global_representations= model.predict(X, batch_size = batch_size)
    

    But now I don't know what to do to have the GO annotations of my sequences ?

    all the best

    opened by rdenise 1
Owner
null
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 2, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 4, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 8, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022