This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Overview

Splinter

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.

Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.

Data

Downloading Few-Shot MRQA Splits

curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot

Pretrained Model

Command for downloading Splinter
curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter 

Pretraining

Create a virtual environment and execute

cd pretraining
pip install -r requirements.txt  # or requirements-gpu.txt for a GPU version

Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:

  • For wiki, a tag starts a new article and a ends it.
  • For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.
Command for creating the pretraining data

This command takes as input a set of files ($INPUT_PATTERN) and creates a tensorized dataset for pretraining. It supports the following masking schemes:

Command for creating the data for Splinter (recurring span selection)
cd pretraining
python create_pretraining_data.py \
    --input_file=$INPUT_PATTERN \
    --output_dir=$OUTPUT_DIR \
    --vocab_file=vocabs/bert-cased-vocab.txt \
    --do_lower_case=False \
    --do_whole_word_mask=False \
    --max_seq_length=512 \
    --num_processes=63 \
    --dupe_factor=5 \
    --max_span_length=10 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30

n-gram statistics are written to ngrams.txt in the output directory.

Command for pretraining Splinter
cd pretraining
python run_pretraining.py \
    --bert_config_file=configs/bert-base-cased-config.json \
    --input_file=$INPUT_FILE \
    --output_dir=$OUTPUT_DIR \
    --max_seq_length=512 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30 \
    --do_train \
    --train_batch_size=256 \
    --learning_rate=1e-4 \
    --num_train_steps=2400000 \
    --num_warmup_steps=10000 \
    --save_checkpoints_steps=10000 \
    --keep_checkpoint_max=240 \
    --use_tpu \
    --num_tpu_cores=8 \
    --tpu_name=$TPU_NAME

This can be trained using GPUs by dropping the use_tpu flag (although it was tested mainly on TPUs).

Convert TensorFlow Model to PyTorch

In order to fine-tune the TF model you pretrained with run_pretraining.py, you will first need to convert it to PyTorch. You can do so by

cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH

Fine-tuning

Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute

cd finetuning
pip install -r requirements.txt

Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to be augmented with a [QUESTION] token. In order to do so, please run

cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"

This will add a [MASK] token to each question in the training data, which will later be replaced by a [QUESTION] token automatically by the QASS layer implementation.

Then fine-tune Splinter by

cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
    --predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=384 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=12 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=10 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=False 

In order to train with automatic mixed precision, install apex and add the --fp16 flag.

See an example script for fine-tuning SpanBERT (rather than Splinter) here.

Citation

If you find this work helpful, please cite us

@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    pages = "3066--3079",
}

Acknowledgements

We would like to thank the European Research Council (ERC) for funding the project, and to Google’s TPU Research Cloud (TRC) for their support in providing TPUs.

Comments
  • Thanks You & Reproducing Baselines

    Thanks You & Reproducing Baselines

    Thank you very much for posting this code! It is extremely helpful in reproducing the results.

    I wanted to inquire if you can share details about reproducing the baselines provided in the paper, as we've been having some troubles with reproducing those numbers, specifically over the roberta-base vanilla experiment.

    See here for more details. Thanks!

    opened by ednussi 9
  • 128 sample in TextbookQA

    128 sample in TextbookQA

    Hello, I have some problem when evaluate in TextbookQA. The result is frustrated. I use the Splinter from your dropbox. Here is my eval_result and finetune_splinter.sh.

    Final Eval:
    exact = 23.750
    **f1 = 29.341**
    total = 400
    HasAns_exact = 23.750
    HasAns_f1 = 29.341
    HasAns_total = 400
    best_exact = 23.750
    best_exact_thresh = 0.000
    best_f1 = 29.341
    best_f1_thresh = 0.000
    
    export MODEL="../../splinter"
    export dataset="textbookqa"
    export OUTPUT_DIR="../../output_model/${dataset}_128"
    python ../run_mrqa.py \
        --model_type=bert \
        --model_name_or_path=$MODEL \
        --qass_head=True \
        --tokenizer_name=$MODEL \
        --output_dir=$OUTPUT_DIR \
        --train_file="../../mrqa-few-shot/${dataset}/${dataset}-train-seed-42-num-examples-128.jsonl" \
        --predict_file="../../mrqa-few-shot/${dataset}/dev.jsonl" \
        --do_train \
        --do_eval \
        --max_seq_length=512 \
        --doc_stride=128 \
        --threads=4 \
        --save_steps=50000 \
        --per_gpu_train_batch_size=16 \
        --per_gpu_eval_batch_size=16 \
        --learning_rate=3e-5 \
        --max_answer_length=10 \
        --warmup_ratio=0.1 \
        --min_steps=200 \
        --num_train_epochs=32 \
        --seed=42 \
        --use_cache=False \
        --evaluate_every_epoch=True  \
        --overwrite_output_dir
    
    opened by yinzhangyue 4
  • Why have [PAD] tokens in the masked spans?

    Why have [PAD] tokens in the masked spans?

    Hi, I was wondering what's the rationale for having [PAD] tokens in masked spans of length more than one instead of just removing the remaining tokens? Here:

    https://github.com/oriram/splinter/blob/1df4c13d5b05f7d1374b1ac1ea49ab238431e855/pretraining/masking.py#L316-L323

    Is the reason just computational efficiency?

    opened by bminixhofer 3
  • Loading checkpoint

    Loading checkpoint

    Hi, awesome work!

    I was wondering if there's any possibility to load in a checkpoint (for both splinter and spanBERT)? Would this be the init_checkpoint flag?

    Thanks!

    opened by jjzha 3
  • question_token_id=104

    question_token_id=104

    Hi, I have a quesition when I read the code, question_token_id=104, 104 is [MASK] in BERT vocab.txt, so it is still [MASK] token? This code is in finetuning/modeling.py. class ModelWithQASSHead(BertPreTrainedModel): def __init__( self, config, replace_mask_with_question_token=False, mask_id=103, question_token_id=104, sep_id=102, initialize_new_qass=True, ):

    opened by yinzhangyue 2
  • [MASK] is appended to the end of questions instead of [QUESTION] when finetuning

    [MASK] is appended to the end of questions instead of [QUESTION] when finetuning

    Hi,

    I found that in qass_preprocess.py. You append [MASK] to the end of questions instead of [QUESTION] https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/finetuning/qass_preprocess.py#L8 https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/finetuning/qass_preprocess.py#L18-L26

    Is there a difference between implementation and paper? or this is a bug?

    opened by Liangtaiwan 2
  • Reproduce Tokens in jsonl examples for finetuning

    Reproduce Tokens in jsonl examples for finetuning

    I've been unsuccessful so far on identifying the right procedure to turn a SQUAD example into the internal .jsonl format the examples are saved in for the finetuning task.

    From a standard example which has: Title paragraphs qas id question answers answer_start text

    {"answers": [{"answer_start": 177, "text": "Denver Broncos"}, {"answer_start": 177, "text": "Denver Broncos"}, {"answer_start": 177, "text": "Denver Broncos"}], "question": "Which NFL team represented the AFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ec"}

    It is not exactly clear to me how to parse it to get back the tokens e.g: question_tokens, context_tokens and tokens_span (for answers)

    When examining the stored examples, in squad-train-seed-42-num-examples-16.jsonl, for example, it seems these tokens are indicators of relative location in sentence by some internal parsing logic. Could you please assist in identifying how I can: Given text -> generate these tokens. My end goal is to be able to feed in new question and answers and store them in the same jsonl format.

    Specifically attaching the example of how the example above is stored within the jsonl file.

    {"id": "", "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.", "qas": [{"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which NFL team represented the AFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ec", "qid": "b0626b3af0764c80b1e6f22c114982c1", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["represented", 15], ["the", 27], ["AFC", 31], ["at", 35], ["Super", 38], ["Bowl", 44], ["50", 49], ["?", 51]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Carolina Panthers", "Carolina Panthers", "Carolina Panthers"], "question": "Which NFL team represented the NFC at Super Bowl 50?", "id": "56be4db0acb8001400a502ed", "qid": "8d96e9feff464a52a15e192b1dc9ed01", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["represented", 15], ["the", 27], ["NFC", 31], ["at", 35], ["Super", 38], ["Bowl", 44], ["50", 49], ["?", 51]], "detected_answers": [{"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}]}, {"answers": ["Santa Clara, California", "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], "question": "Where did Super Bowl 50 take place?", "id": "56be4db0acb8001400a502ee", "qid": "190fdfbc068243a7a04eb3ed59808db8", "question_tokens": [["Where", 0], ["did", 6], ["Super", 10], ["Bowl", 16], ["50", 21], ["take", 24], ["place", 29], ["?", 34]], "detected_answers": [{"text": "Santa Clara, California", "char_spans": [[403, 425]], "token_spans": [[76, 79]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.", "char_spans": [[355, 426]], "token_spans": [[66, 80]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which NFL team won Super Bowl 50?", "id": "56be4db0acb8001400a502ef", "qid": "e8d4a7478ed5439fa55c2660267bcaa1", "question_tokens": [["Which", 0], ["NFL", 6], ["team", 10], ["won", 15], ["Super", 19], ["Bowl", 25], ["50", 30], ["?", 32]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["gold", "gold", "gold"], "question": "What color was used to emphasize the 50th anniversary of the Super Bowl?", "id": "56be4db0acb8001400a502f0", "qid": "74019130542f49e184d733607e565a68", "question_tokens": [["What", 0], ["color", 5], ["was", 11], ["used", 15], ["to", 20], ["emphasize", 23], ["the", 33], ["50th", 37], ["anniversary", 42], ["of", 54], ["the", 57], ["Super", 61], ["Bowl", 67], ["?", 71]], "detected_answers": [{"text": "gold", "char_spans": [[521, 524]], "token_spans": [[99, 99]]}]}, {"answers": ["\"golden anniversary\"", "gold-themed", "\"golden anniversary"], "question": "What was the theme of Super Bowl 50?", "id": "56be8e613aeaaa14008c90d1", "qid": "3729174743f74ed58aa64cb7c7dbc7b3", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["theme", 13], ["of", 19], ["Super", 22], ["Bowl", 28], ["50", 33], ["?", 35]], "detected_answers": [{"text": "\"golden anniversary\"", "char_spans": [[487, 506]], "token_spans": [[93, 96]]}, {"text": "gold-themed", "char_spans": [[521, 531]], "token_spans": [[99, 101]]}, {"text": "\"golden anniversary", "char_spans": [[487, 505]], "token_spans": [[93, 95]]}]}, {"answers": ["February 7, 2016", "February 7", "February 7, 2016"], "question": "What day was the game played on?", "id": "56be8e613aeaaa14008c90d2", "qid": "cc75a31d588842848d9890cafe092dec", "question_tokens": [["What", 0], ["day", 5], ["was", 9], ["the", 13], ["game", 17], ["played", 22], ["on", 29], ["?", 31]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7", "char_spans": [[334, 343]], "token_spans": [[60, 61]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["American Football Conference", "American Football Conference", "American Football Conference"], "question": "What is the AFC short for?", "id": "56be8e613aeaaa14008c90d3", "qid": "7c1424bfa53a4de28c3ec91adfbfe4ab", "question_tokens": [["What", 0], ["is", 5], ["the", 8], ["AFC", 12], ["short", 16], ["for", 22], ["?", 25]], "detected_answers": [{"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}]}, {"answers": ["\"golden anniversary\"", "gold-themed", "gold"], "question": "What was the theme of Super Bowl 50?", "id": "56bea9923aeaaa14008c91b9", "qid": "78a00c316d9e40e69711a9b5c7a932a0", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["theme", 13], ["of", 19], ["Super", 22], ["Bowl", 28], ["50", 33], ["?", 35]], "detected_answers": [{"text": "\"golden anniversary\"", "char_spans": [[487, 506]], "token_spans": [[93, 96]]}, {"text": "gold-themed", "char_spans": [[521, 531]], "token_spans": [[99, 101]]}, {"text": "gold", "char_spans": [[521, 524]], "token_spans": [[99, 99]]}]}, {"answers": ["American Football Conference", "American Football Conference", "American Football Conference"], "question": "What does AFC stand for?", "id": "56bea9923aeaaa14008c91ba", "qid": "1ef03938ae3848798b701dd4dbb30bd9", "question_tokens": [["What", 0], ["does", 5], ["AFC", 10], ["stand", 14], ["for", 20], ["?", 23]], "detected_answers": [{"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}, {"text": "American Football Conference", "char_spans": [[133, 160]], "token_spans": [[26, 28]]}]}, {"answers": ["February 7, 2016", "February 7", "February 7, 2016"], "question": "What day was the Super Bowl played on?", "id": "56bea9923aeaaa14008c91bb", "qid": "cfd440704eee420b9fdf92725a6cdb64", "question_tokens": [["What", 0], ["day", 5], ["was", 9], ["the", 13], ["Super", 17], ["Bowl", 23], ["played", 28], ["on", 35], ["?", 37]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7", "char_spans": [[334, 343]], "token_spans": [[60, 61]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Who won Super Bowl 50?", "id": "56beace93aeaaa14008c91df", "qid": "ca4749d3d0204f418fbfbaa52a1d9ece", "question_tokens": [["Who", 0], ["won", 4], ["Super", 8], ["Bowl", 14], ["50", 19], ["?", 21]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Levi's Stadium", "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara"], "question": "What venue did Super Bowl 50 take place in?", "id": "56beace93aeaaa14008c91e0", "qid": "c2c7e5d3fb87437c80d863d91f8a4e21", "question_tokens": [["What", 0], ["venue", 5], ["did", 11], ["Super", 15], ["Bowl", 21], ["50", 26], ["take", 29], ["place", 34], ["in", 40], ["?", 42]], "detected_answers": [{"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium in the San Francisco Bay Area at Santa Clara", "char_spans": [[355, 413]], "token_spans": [[66, 77]]}]}, {"answers": ["Santa Clara", "Santa Clara", "Santa Clara"], "question": "What city did Super Bowl 50 take place in?", "id": "56beace93aeaaa14008c91e1", "qid": "643b4c1ef1644d18bf6866d95f24f900", "question_tokens": [["What", 0], ["city", 5], ["did", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["take", 28], ["place", 33], ["in", 39], ["?", 41]], "detected_answers": [{"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}]}, {"answers": ["Super Bowl L", "L", "Super Bowl L"], "question": "If Roman numerals were used, what would Super Bowl 50 have been called?", "id": "56beace93aeaaa14008c91e2", "qid": "fad596c3f0e944abae33bf99ceccfbd6", "question_tokens": [["If", 0], ["Roman", 3], ["numerals", 9], ["were", 18], ["used", 23], [",", 27], ["what", 29], ["would", 34], ["Super", 40], ["Bowl", 46], ["50", 51], ["have", 54], ["been", 59], ["called", 64], ["?", 70]], "detected_answers": [{"text": "Super Bowl L", "char_spans": [[693, 704]], "token_spans": [[131, 133]]}, {"text": "L", "char_spans": [[704, 704]], "token_spans": [[133, 133]]}, {"text": "Super Bowl L", "char_spans": [[693, 704]], "token_spans": [[131, 133]]}]}, {"answers": ["2015", "the 2015 season", "2015"], "question": "Super Bowl 50 decided the NFL champion for what season?", "id": "56beace93aeaaa14008c91e3", "qid": "97f0c1c69a694cc8bc9edd41dd4c42be", "question_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["decided", 14], ["the", 22], ["NFL", 26], ["champion", 30], ["for", 39], ["what", 43], ["season", 48], ["?", 54]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "the 2015 season", "char_spans": [[112, 126]], "token_spans": [[21, 23]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["2015", "2016", "2015"], "question": "What year did the Denver Broncos secure a Super Bowl title for the third time?", "id": "56bf10f43aeaaa14008c94fd", "qid": "d14fc2f7c07e4729a02888b4ee4c400c", "question_tokens": [["What", 0], ["year", 5], ["did", 10], ["the", 14], ["Denver", 18], ["Broncos", 25], ["secure", 33], ["a", 40], ["Super", 42], ["Bowl", 48], ["title", 53], ["for", 59], ["the", 63], ["third", 67], ["time", 73], ["?", 77]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["Santa Clara", "Santa Clara", "Santa Clara"], "question": "What city did Super Bowl 50 take place in?", "id": "56bf10f43aeaaa14008c94fe", "qid": "4297cde9c23a4105998937901a7fd3f6", "question_tokens": [["What", 0], ["city", 5], ["did", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["take", 28], ["place", 33], ["in", 39], ["?", 41]], "detected_answers": [{"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}, {"text": "Santa Clara", "char_spans": [[403, 413]], "token_spans": [[76, 77]]}]}, {"answers": ["Levi's Stadium", "Levi's Stadium", "Levi's Stadium"], "question": "What stadium did Super Bowl 50 take place in?", "id": "56bf10f43aeaaa14008c94ff", "qid": "da8f425e541a46c19be04738f41097b3", "question_tokens": [["What", 0], ["stadium", 5], ["did", 13], ["Super", 17], ["Bowl", 23], ["50", 28], ["take", 31], ["place", 36], ["in", 42], ["?", 44]], "detected_answers": [{"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}]}, {"answers": ["24\u201310", "24\u201310", "24\u201310"], "question": "What was the final score of Super Bowl 50? ", "id": "56bf10f43aeaaa14008c9500", "qid": "f944d4b2519b43e4a3dd13dda85495fc", "question_tokens": [["What", 0], ["was", 5], ["the", 9], ["final", 13], ["score", 19], ["of", 25], ["Super", 28], ["Bowl", 34], ["50", 39], ["?", 41]], "detected_answers": [{"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}, {"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}, {"text": "24\u201310", "char_spans": [[267, 271]], "token_spans": [[46, 46]]}]}, {"answers": ["February 7, 2016", "February 7, 2016", "February 7, 2016"], "question": "What month, day and year did Super Bowl 50 take place? ", "id": "56bf10f43aeaaa14008c9501", "qid": "adff197d69764b7fbe2a6ebaae075df4", "question_tokens": [["What", 0], ["month", 5], [",", 10], ["day", 12], ["and", 16], ["year", 20], ["did", 25], ["Super", 29], ["Bowl", 35], ["50", 40], ["take", 43], ["place", 48], ["?", 53]], "detected_answers": [{"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}, {"text": "February 7, 2016", "char_spans": [[334, 349]], "token_spans": [[60, 63]]}]}, {"answers": ["2015", "2016", "2016"], "question": "What year was Super Bowl 50?", "id": "56d20362e7d4791d009025e8", "qid": "c5187d183b494ccf969a15cd0c3039e2", "question_tokens": [["What", 0], ["year", 5], ["was", 10], ["Super", 14], ["Bowl", 20], ["50", 25], ["?", 27]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}, {"text": "2016", "char_spans": [[346, 349]], "token_spans": [[63, 63]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "What team was the AFC champion?", "id": "56d20362e7d4791d009025e9", "qid": "6288b96ce9944dc1b391ff08b6bd8386", "question_tokens": [["What", 0], ["team", 5], ["was", 10], ["the", 14], ["AFC", 18], ["champion", 22], ["?", 30]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Carolina Panthers", "Carolina Panthers", "Carolina Panthers"], "question": "What team was the NFC champion?", "id": "56d20362e7d4791d009025ea", "qid": "80edad8dc6254bd680100e36be2cfa98", "question_tokens": [["What", 0], ["team", 5], ["was", 10], ["the", 14], ["NFC", 18], ["champion", 22], ["?", 30]], "detected_answers": [{"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}, {"text": "Carolina Panthers", "char_spans": [[249, 265]], "token_spans": [[44, 45]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Who won Super Bowl 50?", "id": "56d20362e7d4791d009025eb", "qid": "556c5788c4574cc78d53a241004c4e93", "question_tokens": [["Who", 0], ["won", 4], ["Super", 8], ["Bowl", 14], ["50", 19], ["?", 21]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["2015", "the 2015 season", "2015"], "question": "Super Bowl 50 determined the NFL champion for what season?", "id": "56d600e31c85041400946eae", "qid": "18d7493cca8a44db945ff16a2949e26d", "question_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["determined", 14], ["the", 25], ["NFL", 29], ["champion", 33], ["for", 42], ["what", 46], ["season", 51], ["?", 57]], "detected_answers": [{"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}, {"text": "the 2015 season", "char_spans": [[112, 126]], "token_spans": [[21, 23]]}, {"text": "2015", "char_spans": [[116, 119]], "token_spans": [[22, 22]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "Which team won Super Bowl 50.", "id": "56d600e31c85041400946eb0", "qid": "6392df5f107a4acf9d96321f1e0c177d", "question_tokens": [["Which", 0], ["team", 6], ["won", 11], ["Super", 15], ["Bowl", 21], ["50", 26], [".", 28]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}, {"answers": ["Santa Clara, California.", "Levi's Stadium", "Levi's Stadium"], "question": "Where was Super Bowl 50 held?", "id": "56d600e31c85041400946eb1", "qid": "81485c83e23a45448e2b9d31a679d73b", "question_tokens": [["Where", 0], ["was", 6], ["Super", 10], ["Bowl", 16], ["50", 21], ["held", 24], ["?", 28]], "detected_answers": [{"text": "Santa Clara, California.", "char_spans": [[403, 426]], "token_spans": [[76, 80]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}, {"text": "Levi's Stadium", "char_spans": [[355, 368]], "token_spans": [[66, 68]]}]}, {"answers": ["Super Bowl", "Super Bowl", "Super Bowl"], "question": "The name of the NFL championship game is?", "id": "56d9895ddc89441400fdb50e", "qid": "5668cdd5c25b4549856d628a3ec248d9", "question_tokens": [["The", 0], ["name", 4], ["of", 9], ["the", 12], ["NFL", 16], ["championship", 20], ["game", 33], ["is", 38], ["?", 40]], "detected_answers": [{"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}, {"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}, {"text": "Super Bowl", "token_spans": [[0, 1], [86, 87], [51, 52], [114, 115], [131, 132]], "char_spans": [[0, 9], [449, 458], [293, 302], [609, 618], [693, 702]]}]}, {"answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "question": "What 2015 NFL team one the AFC playoff?", "id": "56d9895ddc89441400fdb510", "qid": "52d6568dd0b74a99866cad2599161a4a", "question_tokens": [["What", 0], ["2015", 5], ["NFL", 10], ["team", 14], ["one", 19], ["the", 23], ["AFC", 27], ["playoff", 31], ["?", 38]], "detected_answers": [{"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}, {"text": "Denver Broncos", "char_spans": [[177, 190]], "token_spans": [[33, 34]]}]}], "context_tokens": [["Super", 0], ["Bowl", 6], ["50", 11], ["was", 14], ["an", 18], ["American", 21], ["football", 30], ["game", 39], ["to", 44], ["determine", 47], ["the", 57], ["champion", 61], ["of", 70], ["the", 73], ["National", 77], ["Football", 86], ["League", 95], ["(", 102], ["NFL", 103], [")", 106], ["for", 108], ["the", 112], ["2015", 116], ["season", 121], [".", 127], ["The", 129], ["American", 133], ["Football", 142], ["Conference", 151], ["(", 162], ["AFC", 163], [")", 166], ["champion", 168], ["Denver", 177], ["Broncos", 184], ["defeated", 192], ["the", 201], ["National", 205], ["Football", 214], ["Conference", 223], ["(", 234], ["NFC", 235], [")", 238], ["champion", 240], ["Carolina", 249], ["Panthers", 258], ["24\u201310", 267], ["to", 273], ["earn", 276], ["their", 281], ["third", 287], ["Super", 293], ["Bowl", 299], ["title", 304], [".", 309], ["The", 311], ["game", 315], ["was", 320], ["played", 324], ["on", 331], ["February", 334], ["7", 343], [",", 344], ["2016", 346], [",", 350], ["at", 352], ["Levi", 355], ["'s", 359], ["Stadium", 362], ["in", 370], ["the", 373], ["San", 377], ["Francisco", 381], ["Bay", 391], ["Area", 395], ["at", 400], ["Santa", 403], ["Clara", 409], [",", 414], ["California", 416], [".", 426], ["As", 428], ["this", 431], ["was", 436], ["the", 440], ["50th", 444], ["Super", 449], ["Bowl", 455], [",", 459], ["the", 461], ["league", 465], ["emphasized", 472], ["the", 483], ["\"", 487], ["golden", 488], ["anniversary", 495], ["\"", 506], ["with", 508], ["various", 513], ["gold", 521], ["-", 525], ["themed", 526], ["initiatives", 533], [",", 544], ["as", 546], ["well", 549], ["as", 554], ["temporarily", 557], ["suspending", 569], ["the", 580], ["tradition", 584], ["of", 594], ["naming", 597], ["each", 604], ["Super", 609], ["Bowl", 615], ["game", 620], ["with", 625], ["Roman", 630], ["numerals", 636], ["(", 645], ["under", 646], ["which", 652], ["the", 658], ["game", 662], ["would", 667], ["have", 673], ["been", 678], ["known", 683], ["as", 689], ["\"", 692], ["Super", 693], ["Bowl", 699], ["L", 704], ["\"", 705], [")", 706], [",", 707], ["so", 709], ["that", 712], ["the", 717], ["logo", 721], ["could", 726], ["prominently", 732], ["feature", 744], ["the", 752], ["Arabic", 756], ["numerals", 763], ["50", 772], [".", 774]]}
    
    opened by ednussi 2
  • create_pretraining_data.py has unknown argument FLAGS.ngrams_file

    create_pretraining_data.py has unknown argument FLAGS.ngrams_file

    while running the script with the default command given in https://github.com/oriram/splinter

    cd pretraining python create_pretraining_data.py
    --input_file=$INPUT_PATTERN
    --output_dir=$OUTPUT_DIR
    --vocab_file=vocabs/bert-cased-vocab.txt
    --do_lower_case=False
    --do_whole_word_mask=False
    --max_seq_length=512
    --num_processes=63
    --dupe_factor=5
    --max_span_length=10
    --recurring_span_selection=True
    --only_recurring_span_selection=True
    --max_questions_per_seq=30

    the script fails because there is no ngrams_file argument given, and the following error occurs:

    Traceback (most recent call last): File "create_pretraining_data.py", line 453, in tf.app.run() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "create_pretraining_data.py", line 441, in main with tf.gfile.GFile(FLAGS.ngrams_file, "w") as writer: File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/platform/flags.py", line 85, in getattr return wrapped.getattr(name) File "/usr/local/lib/python3.7/dist-packages/absl/flags/_flagvalues.py", line 480, in getattr raise AttributeError(name) AttributeError: ngrams_file

    I suggest a small fixup: adding on line 78: flags.DEFINE_string("ngrams_file", None, "The file that will store the ngrams.")

    adding on line 453: flags.mark_flag_as_required("ngrams_file")

    and adding to the script the ngrams_file parameter($NGRAMS_FILE : cd pretraining python create_pretraining_data.py
    --input_file=$INPUT_PATTERN
    --output_dir=$OUTPUT_DIR
    --vocab_file=vocabs/bert-cased-vocab.txt
    --do_lower_case=False
    --do_whole_word_mask=False
    --max_seq_length=512
    --num_processes=63
    --dupe_factor=5
    --max_span_length=10
    --recurring_span_selection=True
    --only_recurring_span_selection=True
    --ngrams_file=$NGRAMS_FILE
    --max_questions_per_seq=30

    opened by ShimonMalnick 1
  • Update README.md

    Update README.md

    unzip mrqa-few-shot.zip to mrqa-few-shot

    To make the following script works, mrqa-few-shot.zip should be unzipped to mrqa-few-shot. https://github.com/oriram/splinter/blob/55866bb87829ee5d0f5981667af51acda95e00cb/README.md#L117-L126

    opened by Liangtaiwan 0
  • Replicating results using huggingface splinter tokenizer and model

    Replicating results using huggingface splinter tokenizer and model

    Hi I enjoyed reading your paper very much and am trying to replicate the results with splinter-large. I have not been able to replicate the fine-tuning results with the huggingface models. Is this because the Splinter tokenizer adds a [SEP] token after the [QUESTION] token, whereas during pretraining the [QUESTION] token is in the same sequence as the answer?

    opened by sammys377 3
Owner
Ori Ram
PhD Candidate at Tel Aviv University, focusing on NLP and Machine Learning
Ori Ram
null 189 Jan 2, 2023
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

null 1 Nov 2, 2021
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 9, 2023
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 7k Feb 18, 2021
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

null 55 Nov 22, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 5, 2021
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

null 10 Dec 19, 2022