Splinter

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.

Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.

Data

Downloading Few-Shot MRQA Splits

curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot

Pretrained Model

Command for downloading Splinter

curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter

Pretraining

Create a virtual environment and execute

cd pretraining
pip install -r requirements.txt  # or requirements-gpu.txt for a GPU version

Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:

For wiki, a tag starts a new article and a ends it.
For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.

Command for creating the pretraining data

This command takes as input a set of files ($INPUT_PATTERN) and creates a tensorized dataset for pretraining. It supports the following masking schemes:

Masked Language Modeling (Devlin et. al 2019)
Masked Language Modeling with Geometric Masking (SpanBERT; Joshi et. al 2020). See an example for creating the data for SpanBERT, and for pretraining it.
Recurring Span Selection (our pretraining scheme)

Command for creating the data for Splinter (recurring span selection)

cd pretraining
python create_pretraining_data.py \
    --input_file=$INPUT_PATTERN \
    --output_dir=$OUTPUT_DIR \
    --vocab_file=vocabs/bert-cased-vocab.txt \
    --do_lower_case=False \
    --do_whole_word_mask=False \
    --max_seq_length=512 \
    --num_processes=63 \
    --dupe_factor=5 \
    --max_span_length=10 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30

n-gram statistics are written to ngrams.txt in the output directory.

Command for pretraining Splinter

cd pretraining
python run_pretraining.py \
    --bert_config_file=configs/bert-base-cased-config.json \
    --input_file=$INPUT_FILE \
    --output_dir=$OUTPUT_DIR \
    --max_seq_length=512 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30 \
    --do_train \
    --train_batch_size=256 \
    --learning_rate=1e-4 \
    --num_train_steps=2400000 \
    --num_warmup_steps=10000 \
    --save_checkpoints_steps=10000 \
    --keep_checkpoint_max=240 \
    --use_tpu \
    --num_tpu_cores=8 \
    --tpu_name=$TPU_NAME

This can be trained using GPUs by dropping the use_tpu flag (although it was tested mainly on TPUs).

Convert TensorFlow Model to PyTorch

In order to fine-tune the TF model you pretrained with run_pretraining.py, you will first need to convert it to PyTorch. You can do so by

cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH

Fine-tuning

Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute

cd finetuning
pip install -r requirements.txt

Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to be augmented with a [QUESTION] token. In order to do so, please run

cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"

This will add a [MASK] token to each question in the training data, which will later be replaced by a [QUESTION] token automatically by the QASS layer implementation.

Then fine-tune Splinter by

cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
    --predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=384 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=12 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=10 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=False

In order to train with automatic mixed precision, install apex and add the --fp16 flag.

See an example script for fine-tuning SpanBERT (rather than Splinter) here.

Citation

If you find this work helpful, please cite us

@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    pages = "3066--3079",
}

Acknowledgements

We would like to thank the European Research Council (ERC) for funding the project, and to Google’s TPU Research Cloud (TRC) for their support in providing TPUs.

I've been unsuccessful so far on identifying the right procedure to turn a SQUAD example into the internal .jsonl format the examples are saved in for the finetuning task.

From a standard example which has: Title paragraphs qas id question answers answer_start text

{"answers": [{"answer_start": 177, "text": "Denver Broncos"}, {"answer_start": 177, "text": "Denver Broncos"}, {"answer_start": 177, "text": "Denver Broncos"}], "question": "Which NFL team represented the AFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ec"}

It is not exactly clear to me how to parse it to get back the tokens e.g: question_tokens, context_tokens and tokens_span (for answers)

When examining the stored examples, in squad-train-seed-42-num-examples-16.jsonl, for example, it seems these tokens are indicators of relative location in sentence by some internal parsing logic. Could you please assist in identifying how I can: Given text -> generate these tokens. My end goal is to be able to feed in new question and answers and store them in the same jsonl format.

Specifically attaching the example of how the example above is stored within the jsonl file.

{"id": "", "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.", "qas": [{"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which NFL team represented the AFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ec", "qid": "b0626b3af0764c80b1e6f22c114982c1", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["represented", 15], ["the", 27], ["AFC", 31], ["at", 35], ["Super", 38], ["Bowl", 44], ["50", 49], ["?", 51]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Carolina Panthers", "Carolina Panthers", "Carolina Panthers"], "question": "Which NFL team represented the NFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ed", "qid": "8d96e9feff464a52a15e192b1dc9ed01", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["represented", 15], ["the", 27], ["NFC", 31], ["at", 35], ["Super", 38], ["Bowl", 44], ["50", 49], ["?", 51]], "detected_answers": [{"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}]}, {"answers": ["Santa Clara, California", "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], "question": "Where did Super Bowl 50 take place?", "id": "56be4db0acb8001400a502ee", "qid": "190fdfbc068243a7a04eb3ed59808db8", "question_tokens": [["Where", 0], ["did", 6], ["Super", 10], ["Bowl", 16], ["50", 21], ["take", 24], ["place", 29], ["?", 34]], "detected_answers": [{"text": "Santa Clara, California", "char_spans": [[403, 425]], "token_spans": [[76, 79]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.", "char_spans": [[355, 426]], "token_spans": [[66, 80]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which NFL team won Super Bowl 50?", "id": "56be4db0acb8001400a502ef", "qid": "e8d4a7478ed5439fa55c2660267bcaa1", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["won", 15], ["Super", 19], ["Bowl", 25], ["50", 30], ["?", 32]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["gold", "gold", "gold"], "question": "What color was used to emphasize the 50th anniversary of the Super Bowl?", "id": "56be4db0acb8001400a502f0", "qid": "74019130542f49e184d733607e565a68", "question_tokens": [["What", 0], ["color", 5], ["was", 11], ["used", 15], ["to", 20], ["emphasize", 23], ["the", 33], ["50th", 37], ["anniversary", 42], ["of", 54], ["the", 57], ["Super", 61], ["Bowl", 67], ["?", 71]], "detected_answers": [{"text": "gold", "char_spans": [[521, 524]], "token_spans": [[99, 99]]}]}, {"answers": ["\"golden anniversary\"", "gold-themed", "\"golden anniversary"], "question": "What was the theme of Super Bowl 50?", "id": "56be8e613aeaaa14008c90d1", "qid": "3729174743f74ed58aa64cb7c7dbc7b3", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["theme", 13], ["of", 19], ["Super", 22], ["Bowl", 28], ["50", 33], ["?", 35]], "detected_answers": [{"text": "\"golden anniversary\"", "char_spans": [[487, 506]], "token_spans": [[93, 96]]}, {"text": "gold-themed", "char_spans": [[521, 531]], "token_spans": [[99, 101]]}, {"text": "\"golden anniversary", "char_spans": [[487, 505]], "token_spans": [[93, 95]]}]}, {"answers": ["February 7, 2016", "February 7", "February 7, 2016"], "question": "What day was the game played on?", "id": "56be8e613aeaaa14008c90d2", "qid": "cc75a31d588842848d9890cafe092dec", "question_tokens": [["What", 0], ["day", 5], ["was", 9], ["the", 13], ["game", 17], ["played", 22], ["on", 29], ["?", 31]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7", "char_spans": [[334, 343]], "token_spans": [[60, 61]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["American Football Conference", "American Football Conference", "American Football Conference"], "question": "What is the AFC short for?", "id": "56be8e613aeaaa14008c90d3", "qid": "7c1424bfa53a4de28c3ec91adfbfe4ab", "question_tokens": [["What", 0], ["is", 5], ["the", 8], ["AFC", 12], ["short", 16], ["for", 22], ["?", 25]], "detected_answers": [{"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}]}, {"answers": ["\"golden anniversary\"", "gold-themed", "gold"], "question": "What was the theme of Super Bowl 50?", "id": "56bea9923aeaaa14008c91b9", "qid": "78a00c316d9e40e69711a9b5c7a932a0", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["theme", 13], ["of", 19], ["Super", 22], ["Bowl", 28], ["50", 33], ["?", 35]], "detected_answers": [{"text": "\"golden anniversary\"", "char_spans": [[487, 506]], "token_spans": [[93, 96]]}, {"text": "gold-themed", "char_spans": [[521, 531]], "token_spans": [[99, 101]]}, {"text": "gold", "char_spans": [[521, 524]], "token_spans": [[99, 99]]}]}, {"answers": ["American Football Conference", "American Football Conference", "American Football Conference"], "question": "What does AFC stand for?", "id": "56bea9923aeaaa14008c91ba", "qid": "1ef03938ae3848798b701dd4dbb30bd9", "question_tokens": [["What", 0], ["does", 5], ["AFC", 10], ["stand", 14], ["for", 20], ["?", 23]], "detected_answers": [{"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}]}, {"answers": ["February 7, 2016", "February 7", "February 7, 2016"], "question": "What day was the Super Bowl played on?", "id": "56bea9923aeaaa14008c91bb", "qid": "cfd440704eee420b9fdf92725a6cdb64", "question_tokens": [["What", 0], ["day", 5], ["was", 9], ["the", 13], ["Super", 17], ["Bowl", 23], ["played", 28], ["on", 35], ["?", 37]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7", "char_spans": [[334, 343]], "token_spans": [[60, 61]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Who won Super Bowl 50?", "id": "56beace93aeaaa14008c91df", "qid": "ca4749d3d0204f418fbfbaa52a1d9ece", "question_tokens": [["Who", 0], ["won", 4], ["Super", 8], ["Bowl", 14], ["50", 19], ["?", 21]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Levi's Stadium", "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara"], "question": "What venue did Super Bowl 50 take place in?", "id": "56beace93aeaaa14008c91e0", "qid": "c2c7e5d3fb87437c80d863d91f8a4e21", "question_tokens": [["What", 0], ["venue", 5], ["did", 11], ["Super", 15], ["Bowl", 21], ["50", 26], ["take", 29], ["place", 34], ["in", 40], ["?", 42]], "detected_answers": [{"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium in the San Francisco Bay Area at Santa Clara", "char_spans": [[355, 413]], "token_spans": [[66, 77]]}]}, {"answers": ["Santa Clara", "Santa Clara", "Santa Clara"], "question": "What city did Super Bowl 50 take place in?", "id": "56beace93aeaaa14008c91e1", "qid": "643b4c1ef1644d18bf6866d95f24f900", "question_tokens": [["What", 0], ["city", 5], ["did", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["take", 28], ["place", 33], ["in", 39], ["?", 41]], "detected_answers": [{"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}]}, {"answers": ["Super Bowl L", "L", "Super Bowl L"], "question": "If Roman numerals were used, what would Super Bowl 50 have been called?", "id": "56beace93aeaaa14008c91e2", "qid": "fad596c3f0e944abae33bf99ceccfbd6", "question_tokens": [["If", 0], ["Roman", 3], ["numerals", 9], ["were", 18], ["used", 23], [",", 27], ["what", 29], ["would", 34], ["Super", 40], ["Bowl", 46], ["50", 51], ["have", 54], ["been", 59], ["called", 64], ["?", 70]], "detected_answers": [{"text": "Super Bowl L", "char_spans": [[693, 704]], "token_spans": [[131, 133]]}, {"text": "L", "char_spans": [[704, 704]], "token_spans": [[133, 133]]}, {"text": "Super Bowl L", "char_spans": [[693, 704]], "token_spans": [[131, 133]]}]}, {"answers": ["2015", "the 2015 season", "2015"], "question": "Super Bowl 50 decided the NFL champion for what season?", "id": "56beace93aeaaa14008c91e3", "qid": "97f0c1c69a694cc8bc9edd41dd4c42be", "question_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["decided", 14], ["the", 22], ["NFL", 26], ["champion", 30], ["for", 39], ["what", 43], ["season", 48], ["?", 54]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "the 2015 season", "char_spans": [[112, 126]], "token_spans": [[21, 23]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["2015", "2016", "2015"], "question": "What year did the Denver Broncos secure a Super Bowl title for the third time?", "id": "56bf10f43aeaaa14008c94fd", "qid": "d14fc2f7c07e4729a02888b4ee4c400c", "question_tokens": [["What", 0], ["year", 5], ["did", 10], ["the", 14], ["Denver", 18], ["Broncos", 25], ["secure", 33], ["a", 40], ["Super", 42], ["Bowl", 48], ["title", 53], ["for", 59], ["the", 63], ["third", 67], ["time", 73], ["?", 77]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["Santa Clara", "Santa Clara", "Santa Clara"], "question": "What city did Super Bowl 50 take place in?", "id": "56bf10f43aeaaa14008c94fe", "qid": "4297cde9c23a4105998937901a7fd3f6", "question_tokens": [["What", 0], ["city", 5], ["did", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["take", 28], ["place", 33], ["in", 39], ["?", 41]], "detected_answers": [{"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}]}, {"answers": ["Levi's Stadium", "Levi's Stadium", "Levi's Stadium"], "question": "What stadium did Super Bowl 50 take place in?", "id": "56bf10f43aeaaa14008c94ff", "qid": "da8f425e541a46c19be04738f41097b3", "question_tokens": [["What", 0], ["stadium", 5], ["did", 13], ["Super", 17], ["Bowl", 23], ["50", 28], ["take", 31], ["place", 36], ["in", 42], ["?", 44]], "detected_answers": [{"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}]}, {"answers": ["24\u201310", "24\u201310", "24\u201310"], "question": "What was the final score of Super Bowl 50? ", "id": "56bf10f43aeaaa14008c9500", "qid": "f944d4b2519b43e4a3dd13dda85495fc", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["final", 13], ["score", 19], ["of", 25], ["Super", 28], ["Bowl", 34], ["50", 39], ["?", 41]], "detected_answers": [{"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}, {"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}, {"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}]}, {"answers": ["February 7, 2016", "February 7, 2016", "February 7, 2016"], "question": "What month, day and year did Super Bowl 50 take place? ", "id": "56bf10f43aeaaa14008c9501", "qid": "adff197d69764b7fbe2a6ebaae075df4", "question_tokens": [["What", 0], ["month", 5], [",", 10], ["day", 12], ["and", 16], ["year", 20], ["did", 25], ["Super", 29], ["Bowl", 35], ["50", 40], ["take", 43], ["place", 48], ["?", 53]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["2015", "2016", "2016"], "question": "What year was Super Bowl 50?", "id": "56d20362e7d4791d009025e8", "qid": "c5187d183b494ccf969a15cd0c3039e2", "question_tokens": [["What", 0], ["year", 5], ["was", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["?", 27]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "What team was the AFC champion?", "id": "56d20362e7d4791d009025e9", "qid": "6288b96ce9944dc1b391ff08b6bd8386", "question_tokens": [["What", 0], ["team", 5], ["was", 10], ["the", 14], ["AFC", 18], ["champion", 22], ["?", 30]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Carolina Panthers", "Carolina Panthers", "Carolina Panthers"], "question": "What team was the NFC champion?", "id": "56d20362e7d4791d009025ea", "qid": "80edad8dc6254bd680100e36be2cfa98", "question_tokens": [["What", 0], ["team", 5], ["was", 10], ["the", 14], ["NFC", 18], ["champion", 22], ["?", 30]], "detected_answers": [{"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Who won Super Bowl 50?", "id": "56d20362e7d4791d009025eb", "qid": "556c5788c4574cc78d53a241004c4e93", "question_tokens": [["Who", 0], ["won", 4], ["Super", 8], ["Bowl", 14], ["50", 19], ["?", 21]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["2015", "the 2015 season", "2015"], "question": "Super Bowl 50 determined the NFL champion for what season?", "id": "56d600e31c85041400946eae", "qid": "18d7493cca8a44db945ff16a2949e26d", "question_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["determined", 14], ["the", 25], ["NFL", 29], ["champion", 33], ["for", 42], ["what", 46], ["season", 51], ["?", 57]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "the 2015 season", "char_spans": [[112, 126]], "token_spans": [[21, 23]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which team won Super Bowl 50.", "id": "56d600e31c85041400946eb0", "qid": "6392df5f107a4acf9d96321f1e0c177d", "question_tokens": [["Which", 0], ["team", 6], ["won", 11], ["Super", 15], ["Bowl", 21], ["50", 26], [".", 28]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Santa Clara, California.", "Levi's Stadium", "Levi's Stadium"], "question": "Where was Super Bowl 50 held?", "id": "56d600e31c85041400946eb1", "qid": "81485c83e23a45448e2b9d31a679d73b", "question_tokens": [["Where", 0], ["was", 6], ["Super", 10], ["Bowl", 16], ["50", 21], ["held", 24], ["?", 28]], "detected_answers": [{"text": "Santa Clara, California.", "char_spans": [[403, 426]], "token_spans": [[76, 80]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}]}, {"answers": ["Super Bowl", "Super Bowl", "Super Bowl"], "question": "The name of the NFL championship game is?", "id": "56d9895ddc89441400fdb50e", "qid": "5668cdd5c25b4549856d628a3ec248d9", "question_tokens": [["The", 0], ["name", 4], ["of", 9], ["the", 12], ["NFL", 16], ["championship", 20], ["game", 33], ["is", 38], ["?", 40]], "detected_answers": [{"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}, {"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}, {"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "What 2015 NFL team one the AFC playoff?", "id": "56d9895ddc89441400fdb510", "qid": "52d6568dd0b74a99866cad2599161a4a", "question_tokens": [["What", 0], ["2015", 5], ["NFL", 10], ["team", 14], ["one", 19], ["the", 23], ["AFC", 27], ["playoff", 31], ["?", 38]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}], "context_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["was", 14], ["an", 18], ["American", 21], ["football", 30], ["game", 39], ["to", 44], ["determine", 47], ["the", 57], ["champion", 61], ["of", 70], ["the", 73], ["National", 77], ["Football", 86], ["League", 95], ["(", 102], ["NFL", 103], [")", 106], ["for", 108], ["the", 112], ["2015", 116], ["season", 121], [".", 127], ["The", 129], ["American", 133], ["Football", 142], ["Conference", 151], ["(", 162], ["AFC", 163], [")", 166], ["champion", 168], ["Denver", 177], ["Broncos", 184], ["defeated", 192], ["the", 201], ["National", 205], ["Football", 214], ["Conference", 223], ["(", 234], ["NFC", 235], [")", 238], ["champion", 240], ["Carolina", 249], ["Panthers", 258], ["24\u201310", 267], ["to", 273], ["earn", 276], ["their", 281], ["third", 287], ["Super", 293], ["Bowl", 299], ["title", 304], [".", 309], ["The", 311], ["game", 315], ["was", 320], ["played", 324], ["on", 331], ["February", 334], ["7", 343], [",", 344], ["2016", 346], [",", 350], ["at", 352], ["Levi", 355], ["'s", 359], ["Stadium", 362], ["in", 370], ["the", 373], ["San", 377], ["Francisco", 381], ["Bay", 391], ["Area", 395], ["at", 400], ["Santa", 403], ["Clara", 409], [",", 414], ["California", 416], [".", 426], ["As", 428], ["this", 431], ["was", 436], ["the", 440], ["50th", 444], ["Super", 449], ["Bowl", 455], [",", 459], ["the", 461], ["league", 465], ["emphasized", 472], ["the", 483], ["\"", 487], ["golden", 488], ["anniversary", 495], ["\"", 506], ["with", 508], ["various", 513], ["gold", 521], ["-", 525], ["themed", 526], ["initiatives", 533], [",", 544], ["as", 546], ["well", 549], ["as", 554], ["temporarily", 557], ["suspending", 569], ["the", 580], ["tradition", 584], ["of", 594], ["naming", 597], ["each", 604], ["Super", 609], ["Bowl", 615], ["game", 620], ["with", 625], ["Roman", 630], ["numerals", 636], ["(", 645], ["under", 646], ["which", 652], ["the", 658], ["game", 662], ["would", 667], ["have", 673], ["been", 678], ["known", 683], ["as", 689], ["\"", 692], ["Super", 693], ["Bowl", 699], ["L", 704], ["\"", 705], [")", 706], [",", 707], ["so", 709], ["that", 712], ["the", 717], ["logo", 721], ["could", 726], ["prominently", 732], ["feature", 744], ["the", 752], ["Arabic", 756], ["numerals", 763], ["50", 772], [".", 774]]}

Thanks You & Reproducing Baselines

Thank you very much for posting this code! It is extremely helpful in reproducing the results.

I wanted to inquire if you can share details about reproducing the baselines provided in the paper, as we've been having some troubles with reproducing those numbers, specifically over the roberta-base vanilla experiment.

See here for more details. Thanks!

opened by ednussi 9

128 sample in TextbookQA

Hello, I have some problem when evaluate in TextbookQA. The result is frustrated. I use the Splinter from your dropbox. Here is my eval_result and finetune_splinter.sh.

Final Eval:
exact = 23.750
**f1 = 29.341**
total = 400
HasAns_exact = 23.750
HasAns_f1 = 29.341
HasAns_total = 400
best_exact = 23.750
best_exact_thresh = 0.000
best_f1 = 29.341
best_f1_thresh = 0.000

export MODEL="../../splinter"
export dataset="textbookqa"
export OUTPUT_DIR="../../output_model/${dataset}_128"
python ../run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../../mrqa-few-shot/${dataset}/${dataset}-train-seed-42-num-examples-128.jsonl" \
    --predict_file="../../mrqa-few-shot/${dataset}/dev.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=512 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=16 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=32 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=True  \
    --overwrite_output_dir

opened by yinzhangyue 4

Why have [PAD] tokens in the masked spans?

Hi, I was wondering what's the rationale for having [PAD] tokens in masked spans of length more than one instead of just removing the remaining tokens? Here:

https://github.com/oriram/splinter/blob/1df4c13d5b05f7d1374b1ac1ea49ab238431e855/pretraining/masking.py#L316-L323

Is the reason just computational efficiency?

opened by bminixhofer 3
Loading checkpoint

Hi, awesome work!

I was wondering if there's any possibility to load in a checkpoint (for both splinter and spanBERT)? Would this be the init_checkpoint flag?

Thanks!

opened by jjzha 3
question_token_id=104

Hi, I have a quesition when I read the code, question_token_id=104, 104 is [MASK] in BERT vocab.txt, so it is still [MASK] token? This code is in finetuning/modeling.py. class ModelWithQASSHead(BertPreTrainedModel): def __init__( self, config, replace_mask_with_question_token=False, mask_id=103, question_token_id=104, sep_id=102, initialize_new_qass=True, ):

opened by yinzhangyue 2
[MASK] is appended to the end of questions instead of [QUESTION] when finetuning

Hi,

I found that in qass_preprocess.py. You append [MASK] to the end of questions instead of [QUESTION] https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/finetuning/qass_preprocess.py#L8 https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/finetuning/qass_preprocess.py#L18-L26

Is there a difference between implementation and paper? or this is a bug?

opened by Liangtaiwan 2

Reproduce Tokens in jsonl examples for finetuning

opened by ednussi 2

create_pretraining_data.py has unknown argument FLAGS.ngrams_file

while running the script with the default command given in https://github.com/oriram/splinter

cd pretraining python create_pretraining_data.py
--input_file=$INPUT_PATTERN
--output_dir=$OUTPUT_DIR
--vocab_file=vocabs/bert-cased-vocab.txt
--do_lower_case=False
--do_whole_word_mask=False
--max_seq_length=512
--num_processes=63
--dupe_factor=5
--max_span_length=10
--recurring_span_selection=True
--only_recurring_span_selection=True
--max_questions_per_seq=30

the script fails because there is no ngrams_file argument given, and the following error occurs:

Traceback (most recent call last): File "create_pretraining_data.py", line 453, in tf.app.run() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "create_pretraining_data.py", line 441, in main with tf.gfile.GFile(FLAGS.ngrams_file, "w") as writer: File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/platform/flags.py", line 85, in getattr return wrapped.getattr(name) File "/usr/local/lib/python3.7/dist-packages/absl/flags/_flagvalues.py", line 480, in getattr raise AttributeError(name) AttributeError: ngrams_file

I suggest a small fixup: adding on line 78: flags.DEFINE_string("ngrams_file", None, "The file that will store the ngrams.")

adding on line 453: flags.mark_flag_as_required("ngrams_file")

and adding to the script the ngrams_file parameter($NGRAMS_FILE : cd pretraining python create_pretraining_data.py
--input_file=$INPUT_PATTERN
--output_dir=$OUTPUT_DIR
--vocab_file=vocabs/bert-cased-vocab.txt
--do_lower_case=False
--do_whole_word_mask=False
--max_seq_length=512
--num_processes=63
--dupe_factor=5
--max_span_length=10
--recurring_span_selection=True
--only_recurring_span_selection=True
--ngrams_file=$NGRAMS_FILE
--max_questions_per_seq=30

opened by ShimonMalnick 1
Update README.md

unzip mrqa-few-shot.zip to mrqa-few-shot

To make the following script works, mrqa-few-shot.zip should be unzipped to mrqa-few-shot. https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/README.md#L117-L126

opened by Liangtaiwan 0
Replicating results using huggingface splinter tokenizer and model

Hi I enjoyed reading your paper very much and am trying to replicate the results with splinter-large. I have not been able to replicate the fine-tuning results with the huggingface models. Is this because the Splinter tokenizer adds a [SEP] token after the [QUESTION] token, whereas during pretraining the [QUESTION] token is in the same sequence as the answer?

opened by sammys377 3

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Related tags

Overview

Splinter

Data

Downloading Few-Shot MRQA Splits

Pretrained Model

Command for downloading Splinter

Pretraining

Command for creating the pretraining data

Command for creating the data for Splinter (recurring span selection)

Command for pretraining Splinter

Convert TensorFlow Model to PyTorch

Fine-tuning

Citation

Acknowledgements

Comments

Owner

Ori Ram

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Task-based datasets, preprocessing, and evaluation for sequence models.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

This repository contains Python scripts for extracting linguistic features from Filipino texts.

This repository contains examples of Task-Informed Meta-Learning