Code associated with the Don't Stop Pretraining ACL 2020 paper

Overview

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper

Citation

@inproceedings{dontstoppretraining2020,
 author = {Suchin Gururangan and Ana Marasović and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
 title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
 year = {2020},
 booktitle = {Proceedings of ACL},
}

Installation

conda env create -f environment.yml
conda activate domains

Working with the latest allennlp version

This repository works with a pinned allennlp version for reproducibility purposes. This pinned version of allennlp relies on pytorch-transformers==1.2.0, which requires you to manually download custom transformer models on disk.

To run this code with the latest allennlp/ transformers version (and use the huggingface model repository to its full capacity) checkout the branch latest-allennlp. Caution that we haven't tested out all models on this branch, so your results may vary from what we report in paper.

If you'd like to use this pinned allennlp version, read on. Otherwise, checkout latest-allennlp.

Available Pretrained Models

We've uploaded DAPT and TAPT models to huggingface.

DAPT models

Available DAPT models:

allenai/cs_roberta_base
allenai/biomed_roberta_base
allenai/reviews_roberta_base
allenai/news_roberta_base

TAPT models

Available TAPT models:

allenai/dsp_roberta_base_dapt_news_tapt_ag_115K
allenai/dsp_roberta_base_tapt_ag_115K
allenai/dsp_roberta_base_dapt_reviews_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_dapt_biomed_tapt_chemprot_4169
allenai/dsp_roberta_base_tapt_chemprot_4169
allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
allenai/dsp_roberta_base_tapt_citation_intent_1688
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_20000
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_70000
allenai/dsp_roberta_base_tapt_imdb_20000
allenai/dsp_roberta_base_tapt_imdb_70000
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_180K
allenai/dsp_roberta_base_tapt_rct_180K
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
allenai/dsp_roberta_base_tapt_rct_500
allenai/dsp_roberta_base_dapt_cs_tapt_sciie_3219
allenai/dsp_roberta_base_tapt_sciie_3219

The final numbers in each model above are the dataset sizes. Larger dataset sizes (e.g. imdb_70000 vs. imdb_20000) are curated TAPT models. These only exist for imdb, rct, and hyperpartisan_news.

Downloading Pretrained models

You can download a pretrained model using the scripts/download_model.py script.

Just supply a model type and serialization directory, like so:

python -m scripts.download_model \
        --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

This will output the allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 model for Citation Intent corpus in $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

Downloading data

All task data is available on a public S3 url; check environments/datasets.py.

If you run the scripts/train.py command (see next step), we will automatically download the relevant dataset(s) using the URLs in environments/datasets.py. However, if you'd like to download the data for use outside of this repository, you will have to curl each dataset individually:

curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
curl -Lo dev.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/dev.jsonl
curl -Lo test.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/test.jsonl

Example commands

Run basic RoBERTa model

The following command will train a RoBERTa classifier on the Citation Intent corpus. Check environments/datasets.py for other datasets you can pass to the --dataset flag.

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation_intent_base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model roberta-base \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

You can supply other downloaded models to this script, by providing a path to the model:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation-intent-dapt-dapt \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

Perform hyperparameter search

First, install allentune: https://github.com/allenai/allentune

Modify search_space/classifier.jsonnet accordingly.

Then run:

allentune search \
            --experiment-name ag_search \
            --num-cpus 56 \
            --num-gpus 4 \
            --search-space search_space/classifier.jsonnet \
            --num-samples 100 \
            --base-config training_config/classifier.jsonnet  \
            --include-package dont_stop_pretraining

Modify --num-gpus and --num-samples accordingly.

Comments
  • How is CS and BioMed corpus filtered from S2ORC dataset

    How is CS and BioMed corpus filtered from S2ORC dataset

    Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!

    question 
    opened by stevezheng23 24
  • allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}

    allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}

    Hi, I have setup the conda environment and ran the scripts, i.e. running firstly

    python -m scripts.download_model --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

    and then

    python -m scripts.train --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test

    but I am getting this error now:

    /home/mikeleatila/anaconda3/envs/domains/bin/python /home/mikeleatila/dont_stop_pretraining_master/scripts/train.py --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test 2022-11-24 09:59:10,204 - INFO - transformers.file_utils - PyTorch version 1.13.0 available. 2022-11-24 09:59:10,816 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . Traceback (most recent call last): File "/home/mikeleatila/anaconda3/envs/domains/bin/allennlp", line 8, in sys.exit(run()) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/init.py", line 93, in main args.func(args) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 144, in train_model_from_args dry_run=args.dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 203, in train_model_from_file dry_run=dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 266, in train_model dry_run=dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 450, in _train_worker batch_weight_key=batch_weight_key, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs cls.name, param_name, annotation, param.default, params, **extras File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs cls.name, param_name, annotation, param.default, params, **extras File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 372, in construct_arg **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs params.assert_empty(cls.name) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/params.py", line 421, in assert_empty "Extra parameters passed to {}: {}".format(class_name, self.params) allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False} 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - random_seed = 58860 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - numpy_seed = 58860 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - pytorch_seed = 58860 2022-11-24 09:59:10,851 - INFO - allennlp.common.checks - Pytorch version: 1.13.0 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.params - type = default 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.token_indexer.TokenIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer 2022-11-24 09:59:10,852 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.max_length = None Traceback (most recent call last): File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 142, in main() File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 139, in main subprocess.run(" ".join(allennlp_command), shell=True, check=True) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-dapt-dapt' returned non-zero exit status @1.

    Many thanks in advance!

    opened by mikeleatila 6
  • Error due to

    Error due to "AllenNLP" library.

    Hi, I was trying to run the command:

    python -m scripts.train \
            --config training_config/classifier.jsonnet \
            --serialization_dir model_logs/citation_intent_base \
            --hyperparameters ROBERTA_CLASSIFIER_SMALL \
            --dataset citation_intent \
            --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
            --device 0 \
            --perf +f1 \
            --evaluate_on_test
    

    Before running this command, I did the following steps in the following order. 1.

         pip install pytorch-transformers
         pip install transformers
         pip install git+https://github.com/kernelmachine/allennlp.git@4ae123d2c3bfb1ea3ce7362cb6c5bca3d094ffa7
    
          python -m scripts.download_model \
            --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
            --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
    

    After these two steps, when I run the scripts.train command, I get the error shown below.

    2020-07-29 10:13:11,360 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
    2020-07-29 10:13:12,114 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
    2020-07-29 10:13:12,117 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
    2020-07-29 10:13:12,709 - INFO - allennlp.common.params - random_seed = 278011
    2020-07-29 10:13:12,710 - INFO - allennlp.common.params - numpy_seed = 278011
    2020-07-29 10:13:12,710 - INFO - allennlp.common.params - pytorch_seed = 278011
    2020-07-29 10:13:12,780 - INFO - allennlp.common.checks - Pytorch version: 1.5.1+cu101
    2020-07-29 10:13:12,782 - INFO - allennlp.common.params - evaluate_on_test = True
    2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras set()
    2020-07-29 10:13:12,782 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling
    2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}} and extras set()
    2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras set()
    2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer
    2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras set()
    2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
    2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.do_lowercase = False
    2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags
    2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0
    2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Model name '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' is a path or url to a directory containing tokenizer files.
    2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/vocab.txt. We won't load it.
    2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/added_tokens.json. We won't load it.
    2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
    2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None
    2020-07-29 10:13:12,785 - INFO - pytorch_transformers.tokenization_utils - loading file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/special_tokens_map.json
    Traceback (most recent call last):
      File "/usr/local/bin/allennlp", line 8, in <module>
        sys.exit(run())
      File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run
        main(prog="allennlp")
      File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 120, in main
        args.func(args)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 150, in train_model_from_args
        args.cache_prefix,
      File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 199, in train_model_from_file
        cache_prefix,
      File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 257, in train_model
        params, serialization_dir, recover, cache_directory, cache_prefix
      File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer_pieces.py", line 45, in from_params
        all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/training/util.py", line 169, in datasets_from_params
        dataset_reader = DatasetReader.from_params(dataset_reader_params)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
        return subclass.from_params(params=params, **extras)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 398, in from_params
        kwargs = create_kwargs(cls, params, **extras)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 140, in create_kwargs
        kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 265, in construct_arg
        value_dict[key] = value_cls.from_params(params=value_params, **subextras)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params
        return subclass.from_params(params=params, **extras)
      File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 400, in from_params
        return cls(**kwargs)  # type: ignore
      File "/usr/local/lib/python3.6/dist-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=do_lowercase)
      File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_auto.py", line 89, in from_pretrained
        return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 216, in from_pretrained
        return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_utils.py", line 327, in _from_pretrained
        tokenizer = cls(*inputs, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 128, in __init__
        if not os.path.isfile(vocab_file):
      File "/usr/lib/python3.6/genericpath.py", line 30, in isfile
        st = os.stat(path)
    TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
    Traceback (most recent call last):
      File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/content/dont-stop-pretraining/scripts/train.py", line 143, in <module>
        main()
      File "/content/dont-stop-pretraining/scripts/train.py", line 140, in main
        subprocess.run(" ".join(allennlp_command), shell=True, check=True)
      File "/usr/lib/python3.6/subprocess.py", line 438, in run
        output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.
    

    I am running the code in Google Colab. I will be very grateful if anyone can help me in understanding where I am going wrong. Thanks.

    opened by nandinib1999 6
  • Reproduce the result of Chemprot using RoBERTa

    Reproduce the result of Chemprot using RoBERTa

    Is there anyone who tried to produce the result on Chemprot using RoBERta?

    I used the command it provided, but I only got as half of the F-score as it shown on the paper.

    -----------Command I used---------------- python -m scripts.train
    --config training_config/classifier.jsonnet
    --serialization_dir model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621
    --hyperparameters ROBERTA_CLASSIFIER_BIG
    --dataset chemprot
    --model roberta-base
    --device 0
    --perf +f1
    --evaluate_on_test
    --seed 0

    -------------Result I got--------------------- 2020-10-28 15:47:32,735 - INFO - allennlp.models.archival - archiving weights and vocabulary to model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621/model.tar.gz 2020-10-28 15:48:00,526 - INFO - allennlp.common.util - Metrics: { "best_epoch": 2, "peak_cpu_memory_MB": 4431.752, "peak_gpu_0_memory_MB": 13629, "peak_gpu_1_memory_MB": 10, "training_duration": "0:05:36.203710", "training_start_epoch": 0, "training_epochs": 2, "epoch": 2, "training_f1": 0.5388954075483176, "training_accuracy": 0.8424082513792276, "training_loss": 0.528517140297649, "training_cpu_memory_MB": 4431.752, "training_gpu_0_memory_MB": 13629, "training_gpu_1_memory_MB": 10, "validation_f1": 0.5084102337176983, "validation_accuracy": 0.8026370004120313, "validation_loss": 0.6763799888523001, "best_validation_f1": 0.5084102337176983, "best_validation_accuracy": 0.8026370004120313, "best_validation_loss": 0.6763799888523001, "test_f1": 0.4786599434625644, "test_accuracy": 0.7999423464975497, "test_loss": 0.679223679412495 }

    -----The result shown on the paper--------- image

    opened by zhutixiaojie0120 4
  • Are codes for pretraining available?

    Are codes for pretraining available?

    It seems that this repository only contains the code to perform finetuning pretrained RoBERTa. Are code for pretraining available now? Can you possibly add some command example for doing TAPT? Any advice or explanation will be highly appreciated. Thanks in advance!

    opened by PrettyMeng 3
  • IMDB train/dev split

    IMDB train/dev split

    How did you split the IMDB dataset into train and dev parts (25.000 -> 20.000 + 5.000)? Is this some kind of standard split or did you randomly split?

    opened by Hapiny 3
  • How long does it take for the training process?

    How long does it take for the training process?

    Hi, I am doing DAPT on CS domain with 38 GB CS data on a single TPU V3-8. It is estimated that will cost 20-24 hours for one epoch. I see from the paper you use TPU V3-8 as well but I do not find the time information in the paper. Would you like to share how much time you need for pretraining? Thanks!

    opened by shizhediao 2
  • About Datasets

    About Datasets

    Hi. First of all, thank you for your great work on task adaptation!

    Since I want to do some researches about task adaptation of language model,

    I think that it will be cool if I can use the dataset that you used.

    As far as I saw, the s3 link of datasets is set as private, then other people cannot download it.

    Am I miss something even if I can download the dataset from the given link?

    If not, do you have any plan to open the dataset you used to the public?

    As I suppose, I think it may be difficult since some datasets have copyright...

    Thank you for reading my issues!

    opened by Nardien 2
  • How is News corpus filtered from RealNews dataset

    How is News corpus filtered from RealNews dataset

    Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!

    opened by stevezheng23 2
  • pre-train commands,where is `ADAPTIVE_PRETRAINING.md`file for DAPT/TAPT commands?

    pre-train commands,where is `ADAPTIVE_PRETRAINING.md`file for DAPT/TAPT commands?

    Hi there, check the ADAPTIVE_PRETRAINING.mdfile for DAPT/TAPT commands

    Originally posted by @kernelmachine in https://github.com/allenai/dont-stop-pretraining/issues/10#issuecomment-668235314 I cannot find the 'ADAPTIVE_PRETRAINING.md' file, thank you!

    opened by gghhoosstt 1
  • TAPT dataset

    TAPT dataset

    I am trying to understand the method for TAPT. For chemprot, for example, are you using the same train dataset that is being used for fine-tuning? This chemprot dataset was just augmented with "randomly masking different tokens across epochs, using the masking probability of 0.15". Or is there some other UNLABELED dataset used for chemprot when doing the TAPT, and then this labeled chemprot data that is open sourced was only used for fine tuning on the downstream task?

    opened by aabid0193 1
  • Does DAPT lead to forgetting over the original LM domain or overfitting over the target domain?

    Does DAPT lead to forgetting over the original LM domain or overfitting over the target domain?

    Further DAPT was implemented on each domain for 12.5K steps with unlabeled data from target domain only. I am wondering whether not adding unlabeled data from original LM domain leads to detrimental forgetting or overfitting.

    opened by dr-GitHub-account 0
  • 您好!我运行时为啥老出现各种奇葩问题?显示  /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

    您好!我运行时为啥老出现各种奇葩问题?显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

    显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

    opened by Shajiu 0
  • How to preprocess the data ?

    How to preprocess the data ?

    Hi, after downloading the dataset. I want to know is there any post processing about it ?

    This is the keys of each dataset. In ag dataset should the text and headline be concatenated for classification ? download ipynb — mybert  SSH: 45a3159k71 zicp vip  2022-04-29 22-31-16

    opened by Hannibal046 0
  • TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

    TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

    Hi,

    I am trying to train the biomed_roberta_base model on the chemprot dataset using the provided scripts.train python command and encounter the below issues.

    Screenshot from 2022-04-20 00-16-11

    The above dataset and models have been downloaded as stated in the README of the master branch. Also since the mentioned environment wasn't working for me I am using the below conda environment

    Screenshot from 2022-04-20 00-20-29

    Please let me know how to solve the above issue. It seems like the tokenizer asks for a vocab file but I am not sure how to provide one.

    opened by akashgupta97 2
  • when do domain-adaptive pretraining, seems can  not extend the vocabulary?

    when do domain-adaptive pretraining, seems can not extend the vocabulary?

    After use my own corpus to do domain-adaptive pretraining, the vocab.txt is the same size with the initialized model(BERT-base). In short, the domain-adaptive pretraining does not extend the vocabulary of the new domain? Therefore same specific vocabulary of the new domain still not exist in the domain-adaptive pretraining result vocab.txt. Is that?

    opened by MrRace 0
  • Fail to reproduce the work

    Fail to reproduce the work

    Could you please check the implementation steps you provided in the README file?

    I followed your instructions but find it very hard to reproduce this work, someerrors would come out like version inconsistency between allennlp and transformers, then lead to error like:

    subprocess.CalledProcessError: Command 'allennlp train training_config/classifier.jsonnet --include-package dont_stop_pretraining -s model_logs\citation_intent_base' returned non-zero exit status 1.

    Or just there are some wrong steps during my implementation? It is really confusing and frustrating.

    opened by muyuhuatang 4
Owner
AI2
AI2
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 7, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

null 67 Dec 15, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
null 189 Jan 2, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

null 14 Jan 3, 2023
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 121 Jan 6, 2021
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 7, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022