Code associated with the Don't Stop Pretraining ACL 2020 paper

AI2

Last update: Jan 4, 2023

Related tags

Overview

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper

Citation

@inproceedings{dontstoppretraining2020,
 author = {Suchin Gururangan and Ana Marasović and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
 title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
 year = {2020},
 booktitle = {Proceedings of ACL},
}

Installation

conda env create -f environment.yml
conda activate domains

Working with the latest allennlp version

This repository works with a pinned allennlp version for reproducibility purposes. This pinned version of allennlp relies on pytorch-transformers==1.2.0, which requires you to manually download custom transformer models on disk.

To run this code with the latest allennlp/ transformers version (and use the huggingface model repository to its full capacity) checkout the branch latest-allennlp. Caution that we haven't tested out all models on this branch, so your results may vary from what we report in paper.

If you'd like to use this pinned allennlp version, read on. Otherwise, checkout latest-allennlp.

Available Pretrained Models

We've uploaded DAPT and TAPT models to huggingface.

DAPT models

Available DAPT models:

allenai/cs_roberta_base
allenai/biomed_roberta_base
allenai/reviews_roberta_base
allenai/news_roberta_base

TAPT models

Available TAPT models:

allenai/dsp_roberta_base_dapt_news_tapt_ag_115K
allenai/dsp_roberta_base_tapt_ag_115K
allenai/dsp_roberta_base_dapt_reviews_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_dapt_biomed_tapt_chemprot_4169
allenai/dsp_roberta_base_tapt_chemprot_4169
allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
allenai/dsp_roberta_base_tapt_citation_intent_1688
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_20000
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_70000
allenai/dsp_roberta_base_tapt_imdb_20000
allenai/dsp_roberta_base_tapt_imdb_70000
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_180K
allenai/dsp_roberta_base_tapt_rct_180K
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
allenai/dsp_roberta_base_tapt_rct_500
allenai/dsp_roberta_base_dapt_cs_tapt_sciie_3219
allenai/dsp_roberta_base_tapt_sciie_3219

The final numbers in each model above are the dataset sizes. Larger dataset sizes (e.g. imdb_70000 vs. imdb_20000) are curated TAPT models. These only exist for imdb, rct, and hyperpartisan_news.

Downloading Pretrained models

You can download a pretrained model using the scripts/download_model.py script.

Just supply a model type and serialization directory, like so:

python -m scripts.download_model \
        --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

This will output the allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 model for Citation Intent corpus in $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

Downloading data

All task data is available on a public S3 url; check environments/datasets.py.

If you run the scripts/train.py command (see next step), we will automatically download the relevant dataset(s) using the URLs in environments/datasets.py. However, if you'd like to download the data for use outside of this repository, you will have to curl each dataset individually:

curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
curl -Lo dev.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/dev.jsonl
curl -Lo test.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/test.jsonl

Example commands

Run basic RoBERTa model

The following command will train a RoBERTa classifier on the Citation Intent corpus. Check environments/datasets.py for other datasets you can pass to the --dataset flag.

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation_intent_base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model roberta-base \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

You can supply other downloaded models to this script, by providing a path to the model:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation-intent-dapt-dapt \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

Perform hyperparameter search

First, install allentune: https://github.com/allenai/allentune

Modify search_space/classifier.jsonnet accordingly.

Then run:

allentune search \
            --experiment-name ag_search \
            --num-cpus 56 \
            --num-gpus 4 \
            --search-space search_space/classifier.jsonnet \
            --num-samples 100 \
            --base-config training_config/classifier.jsonnet  \
            --include-package dont_stop_pretraining

Modify --num-gpus and --num-samples accordingly.

Comments

How is CS and BioMed corpus filtered from S2ORC dataset

Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't find details on this in the original paper, could you share some light on this? Thanks!
question

opened by stevezheng23 24
$allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}$

allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False}

Hi, I have setup the conda environment and ran the scripts, i.e. running firstly

python -m scripts.download_model --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

and then

python -m scripts.train --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test

but I am getting this error now:

/home/mikeleatila/anaconda3/envs/domains/bin/python /home/mikeleatila/dont_stop_pretraining_master/scripts/train.py --config training_config/classifier.jsonnet --serialization_dir model_logs/citation-intent-dapt-dapt --hyperparameters ROBERTA_CLASSIFIER_SMALL --dataset citation_intent --model /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 --device 0 --perf +f1 --evaluate_on_test 2022-11-24 09:59:10,204 - INFO - transformers.file_utils - PyTorch version 1.13.0 available. 2022-11-24 09:59:10,816 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . Traceback (most recent call last): File "/home/mikeleatila/anaconda3/envs/domains/bin/allennlp", line 8, in sys.exit(run()) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/init.py", line 93, in main args.func(args) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 144, in train_model_from_args dry_run=args.dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 203, in train_model_from_file dry_run=dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 266, in train_model dry_run=dry_run, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/commands/train.py", line 450, in _train_worker batch_weight_key=batch_weight_key, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs cls.name, param_name, annotation, param.default, params, **extras File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 188, in create_kwargs cls.name, param_name, annotation, param.default, params, **extras File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 294, in pop_and_construct_arg return construct_arg(class_name, name, popped_params, annotation, default, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 372, in construct_arg **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 329, in construct_arg return annotation.from_params(params=popped_params, **subextras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 555, in from_params **extras, File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 583, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/from_params.py", line 199, in create_kwargs params.assert_empty(cls.name) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/site-packages/allennlp/common/params.py", line 421, in assert_empty "Extra parameters passed to {}: {}".format(class_name, self.params) allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False} 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - random_seed = 58860 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - numpy_seed = 58860 2022-11-24 09:59:10,850 - INFO - allennlp.common.params - pytorch_seed = 58860 2022-11-24 09:59:10,851 - INFO - allennlp.common.checks - Pytorch version: 1.13.0 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.params - type = default 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.commands.train.TrainModel'> from params {'validation_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/dev.jsonl', 'evaluate_on_test': True, 'model': {'dropout': '0.1', 'feedforward_layer': {'activations': 'tanh', 'hidden_dims': 768, 'input_dim': 768, 'num_layers': 1}, 'seq2vec_encoder': {'embedding_dim': 768, 'type': 'cls_pooler_x'}, 'text_field_embedder': {'roberta': {'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'type': 'basic_classifier_with_f1'}, 'iterator': {'batch_size': 16, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'validation_iterator': {'batch_size': 64, 'sorting_keys': [['tokens', 'num_tokens']], 'type': 'bucket'}, 'train_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/train.jsonl', 'test_data_path': 'https://s3-us-west-2.amazonaws.com/allennlp/dont_stop_pretraining/data/citation_intent/test.jsonl', 'trainer': {'cuda_device': 0, 'gradient_accumulation_batch_size': 16, 'num_epochs': 10, 'num_serialized_models_to_keep': 0, 'optimizer': {'b1': 0.9, 'b2': 0.98, 'e': 1e-06, 'lr': '2e-05', 'max_grad_norm': 1, 'parameter_groups': [[['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'layer_norm.weight'], {'weight_decay': 0}, []]], 'schedule': 'warmup_linear', 't_total': -1, 'type': 'bert_adam', 'warmup': 0.06, 'weight_decay': 0.1}, 'patience': 3, 'validation_metric': '+f1'}, 'validation_dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}, 'dataset_reader': {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': [''], 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': [''], 'type': 'pretrained_transformer'}} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,851 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.token_indexer.TokenIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer 2022-11-24 09:59:10,852 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer'> from params {'do_lowercase': False, 'model_name': '/home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras {'batch_weight_key', 'local_rank', 'serialization_dir'} 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /home/mikeleatila/dont_stop_pretraining_master/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags 2022-11-24 09:59:10,852 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.max_length = None Traceback (most recent call last): File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 142, in main() File "/home/mikeleatila/dont_stop_pretraining_master/scripts/train.py", line 139, in main subprocess.run(" ".join(allennlp_command), shell=True, check=True) File "/home/mikeleatila/anaconda3/envs/domains/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-dapt-dapt' returned non-zero exit status @1.

~~Many thanks in advance!~~

~~opened by mikeleatila 6~~

Error due to "AllenNLP" library.

Hi, I was trying to run the command:

python -m scripts.train \ --config training_config/classifier.jsonnet \ --serialization_dir model_logs/citation_intent_base \ --hyperparameters ROBERTA_CLASSIFIER_SMALL \ --dataset citation_intent \ --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \ --device 0 \ --perf +f1 \ --evaluate_on_test

Before running this command, I did the following steps in the following order. 1.

pip install pytorch-transformers pip install transformers pip install git+https://github.com/kernelmachine/allennlp.git@4ae123d2c3bfb1ea3ce7362cb6c5bca3d094ffa7

python -m scripts.download_model \ --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \ --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

After these two steps, when I run the scripts.train command, I get the error shown below.

2020-07-29 10:13:11,360 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 2020-07-29 10:13:12,114 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 2020-07-29 10:13:12,117 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 2020-07-29 10:13:12,709 - INFO - allennlp.common.params - random_seed = 278011 2020-07-29 10:13:12,710 - INFO - allennlp.common.params - numpy_seed = 278011 2020-07-29 10:13:12,710 - INFO - allennlp.common.params - pytorch_seed = 278011 2020-07-29 10:13:12,780 - INFO - allennlp.common.checks - Pytorch version: 1.5.1+cu101 2020-07-29 10:13:12,782 - INFO - allennlp.common.params - evaluate_on_test = True 2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}, 'type': 'text_classification_json_with_sampling'} and extras set() 2020-07-29 10:13:12,782 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json_with_sampling 2020-07-29 10:13:12,782 - INFO - allennlp.common.from_params - instantiating class <class 'dont_stop_pretraining.data.dataset_readers.text_classification_json_reader_with_sampling.TextClassificationJsonReaderWithSampling'> from params {'lazy': False, 'max_sequence_length': 512, 'token_indexers': {'roberta': {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'}}, 'tokenizer': {'do_lowercase': False, 'end_tokens': ['</s>'], 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'start_tokens': ['<s>'], 'type': 'pretrained_transformer'}} and extras set() 2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688', 'type': 'pretrained_transformer'} and extras set() 2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.type = pretrained_transformer 2020-07-29 10:13:12,783 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer from params {'do_lowercase': False, 'model_name': '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688'} and extras set() 2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.model_name = /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 2020-07-29 10:13:12,783 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.do_lowercase = False 2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.namespace = tags 2020-07-29 10:13:12,784 - INFO - allennlp.common.params - dataset_reader.token_indexers.roberta.token_min_padding_length = 0 2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Model name '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '/content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688' is a path or url to a directory containing tokenizer files. 2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/vocab.txt. We won't load it. 2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - Didn't find file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/added_tokens.json. We won't load it. 2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None 2020-07-29 10:13:12,784 - INFO - pytorch_transformers.tokenization_utils - loading file None 2020-07-29 10:13:12,785 - INFO - pytorch_transformers.tokenization_utils - loading file /content/dont-stop-pretraining/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688/special_tokens_map.json Traceback (most recent call last): File "/usr/local/bin/allennlp", line 8, in <module> sys.exit(run()) File "/usr/local/lib/python3.6/dist-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 120, in main args.func(args) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 150, in train_model_from_args args.cache_prefix, File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 199, in train_model_from_file cache_prefix, File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 257, in train_model params, serialization_dir, recover, cache_directory, cache_prefix File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer_pieces.py", line 45, in from_params all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix) File "/usr/local/lib/python3.6/dist-packages/allennlp/training/util.py", line 169, in datasets_from_params dataset_reader = DatasetReader.from_params(dataset_reader_params) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params return subclass.from_params(params=params, **extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 398, in from_params kwargs = create_kwargs(cls, params, **extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 140, in create_kwargs kwargs[name] = construct_arg(cls, name, annotation, param.default, params, **extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 265, in construct_arg value_dict[key] = value_cls.from_params(params=value_params, **subextras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 377, in from_params return subclass.from_params(params=params, **extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 400, in from_params return cls(**kwargs) # type: ignore File "/usr/local/lib/python3.6/dist-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__ self.tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=do_lowercase) File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_auto.py", line 89, in from_pretrained return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 216, in from_pretrained return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_utils.py", line 327, in _from_pretrained tokenizer = cls(*inputs, **kwargs) File "/usr/local/lib/python3.6/dist-packages/pytorch_transformers/tokenization_bert.py", line 128, in __init__ if not os.path.isfile(vocab_file): File "/usr/lib/python3.6/genericpath.py", line 30, in isfile st = os.stat(path) TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/dont-stop-pretraining/scripts/train.py", line 143, in <module> main() File "/content/dont-stop-pretraining/scripts/train.py", line 140, in main subprocess.run(" ".join(allennlp_command), shell=True, check=True) File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'allennlp train --include-package dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation_intent_base' returned non-zero exit status 1.

I am running the code in Google Colab. I will be very grateful if anyone can help me in understanding where I am going wrong. Thanks.

opened by nandinib1999 6

Reproduce the result of Chemprot using RoBERTa

Is there anyone who tried to produce the result on Chemprot using RoBERta?

I used the command it provided, but I only got as half of the F-score as it shown on the paper.

-----------Command I used---------------- python -m scripts.train
--config training_config/classifier.jsonnet
--serialization_dir model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621
--hyperparameters ROBERTA_CLASSIFIER_BIG
--dataset chemprot
--model roberta-base
--device 0
--perf +f1
--evaluate_on_test
--seed 0

-------------Result I got--------------------- 2020-10-28 15:47:32,735 - INFO - allennlp.models.archival - archiving weights and vocabulary to model_logs/chemprot-ROBERTA_CLASSIFIER_BIG-202010271621/model.tar.gz 2020-10-28 15:48:00,526 - INFO - allennlp.common.util - Metrics: { "best_epoch": 2, "peak_cpu_memory_MB": 4431.752, "peak_gpu_0_memory_MB": 13629, "peak_gpu_1_memory_MB": 10, "training_duration": "0:05:36.203710", "training_start_epoch": 0, "training_epochs": 2, "epoch": 2, "training_f1": 0.5388954075483176, "training_accuracy": 0.8424082513792276, "training_loss": 0.528517140297649, "training_cpu_memory_MB": 4431.752, "training_gpu_0_memory_MB": 13629, "training_gpu_1_memory_MB": 10, "validation_f1": 0.5084102337176983, "validation_accuracy": 0.8026370004120313, "validation_loss": 0.6763799888523001, "best_validation_f1": 0.5084102337176983, "best_validation_accuracy": 0.8026370004120313, "best_validation_loss": 0.6763799888523001, "test_f1": 0.4786599434625644, "test_accuracy": 0.7999423464975497, "test_loss": 0.679223679412495 }

-----The result shown on the paper---------

opened by zhutixiaojie0120 4

Are codes for pretraining available?

It seems that this repository only contains the code to perform finetuning pretrained RoBERTa. Are code for pretraining available now? Can you possibly add some command example for doing TAPT? Any advice or explanation will be highly appreciated. Thanks in advance!

opened by PrettyMeng 3

IMDB train/dev split

How did you split the IMDB dataset into train and dev parts (25.000 -> 20.000 + 5.000)? Is this some kind of standard split or did you randomly split?

opened by Hapiny 3

How long does it take for the training process?

Hi, I am doing DAPT on CS domain with 38 GB CS data on a single TPU V3-8. It is estimated that will cost 20-24 hours for one epoch. I see from the paper you use TPU V3-8 as well but I do not find the time information in the paper. Would you like to share how much time you need for pretraining? Thanks!

opened by shizhediao 2

About Datasets

Hi. First of all, thank you for your great work on task adaptation!

Since I want to do some researches about task adaptation of language model,

I think that it will be cool if I can use the dataset that you used.

As far as I saw, the s3 link of datasets is set as private, then other people cannot download it.

Am I miss something even if I can download the dataset from the given link?

If not, do you have any plan to open the dataset you used to the public?

As I suppose, I think it may be difficult since some datasets have copyright...

Thank you for reading my issues!

opened by Nardien 2

How is News corpus filtered from RealNews dataset

Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!

opened by stevezheng23 2

pre-train commands，where is `ADAPTIVE_PRETRAINING.md`file for DAPT/TAPT commands？

Hi there, check the ADAPTIVE_PRETRAINING.mdfile for DAPT/TAPT commands

Originally posted by @kernelmachine in https://github.com/allenai/dont-stop-pretraining/issues/10#issuecomment-668235314 I cannot find the 'ADAPTIVE_PRETRAINING.md' file, thank you!

opened by gghhoosstt 1

TAPT dataset

I am trying to understand the method for TAPT. For chemprot, for example, are you using the same train dataset that is being used for fine-tuning? This chemprot dataset was just augmented with "randomly masking different tokens across epochs, using the masking probability of 0.15". Or is there some other UNLABELED dataset used for chemprot when doing the TAPT, and then this labeled chemprot data that is open sourced was only used for fine tuning on the downstream task?

opened by aabid0193 1

Does DAPT lead to forgetting over the original LM domain or overfitting over the target domain?

Further DAPT was implemented on each domain for 12.5K steps with unlabeled data from target domain only. I am wondering whether not adding unlabeled data from original LM domain leads to detrimental forgetting or overfitting.

opened by dr-GitHub-account 0

您好！我运行时为啥老出现各种奇葩问题？显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127

opened by Shajiu 0

How to preprocess the data ?

Hi, after downloading the dataset. I want to know is there any post processing about it ?

This is the keys of each dataset. In ag dataset should the text and headline be concatenated for classification ?

opened by Hannibal046 0

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Hi,

I am trying to train the biomed_roberta_base model on the chemprot dataset using the provided scripts.train python command and encounter the below issues.

The above dataset and models have been downloaded as stated in the README of the master branch. Also since the mentioned environment wasn't working for me I am using the below conda environment

Please let me know how to solve the above issue. It seems like the tokenizer asks for a vocab file but I am not sure how to provide one.

opened by akashgupta97 2

when do domain-adaptive pretraining, seems can not extend the vocabulary？

After use my own corpus to do domain-adaptive pretraining, the vocab.txt is the same size with the initialized model(BERT-base). In short, the domain-adaptive pretraining does not extend the vocabulary of the new domain? Therefore same specific vocabulary of the new domain still not exist in the domain-adaptive pretraining result vocab.txt. Is that?

opened by MrRace 0

Fail to reproduce the work

Could you please check the implementation steps you provided in the README file?

I followed your instructions but find it very hard to reproduce this work, someerrors would come out like version inconsistency between allennlp and transformers, then lead to error like:

subprocess.CalledProcessError: Command 'allennlp train training_config/classifier.jsonnet --include-package dont_stop_pretraining -s model_logs\citation_intent_base' returned non-zero exit status 1.

Or just there are some wrong steps during my implementation? It is really confusing and frustrating.

opened by muyuhuatang 4

Code associated with the Don't Stop Pretraining ACL 2020 paper

Related tags

Overview

dont-stop-pretraining

Citation

Installation

Working with the latest allennlp version

Available Pretrained Models

DAPT models

TAPT models

Downloading Pretrained models

Downloading data

Example commands

Run basic RoBERTa model

Perform hyperparameter search

Comments

Owner

AI2

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Predict an emoji that is associated with a text

Associated Repository for "Translation between Molecules and Natural Language"

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Get list of common stop words in various languages in Python

Get list of common stop words in various languages in Python

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Turkish Stop Words Türkçe Dolgu Sözcükleri

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch