NAACL2021 - COIL Contextualized Lexical Retriever

Overview

COIL

Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning COIL models well as encoding and retrieving with COIL index.

The code was refactored from our original experiment version to use the huggingface Trainer interface for future compatibility.

Contextualized Exact Lexical Match

COIL systems are based on the idea of contextualized exact lexical match. It replaces term frequency based term matching in classical systems like BM25 with contextualized word representation similarities. It thereby gains the ability to model matching of context. Meanwhile COIL confines itself to comparing exact lexical matched tokens and therefore can retrieve efficiently with inverted list form data structure. Details can be found in our paper.

Dependencies

The code has been tested with,

pytorch==1.8.1
transformers==4.2.1
datasets==1.1.3

To use the retriever, you need in addition,

torch_scatter==2.0.6
faiss==1.7.0

Resource

MSMARCO Passage Ranking

Here we present two systems: one uses hard negatives (HN) and the other does not. COIL w/o HN is trained with BM25 negatives, and COIL w/ HN is trained in addition with hard negatives mined with another trained COIL.

Configuration MARCO DEV MRR@10 TREC DL19 NDCG@5 TREC DL19 NDCG@10 Chekpoint MARCO Train Ranking MARCO Dev Ranking
COIL w/o HN 0.353 0.7285 0.7136 model-checkpoint.tar.gz train-ranking.tar.gz dev-ranking.tsv
COIL w/ HN 0.373 0.7453 0.7055 hn-checkpoint.tar.gz train-ranking.tar.gz dev-ranking.tsv
  • Right Click to Download.
  • The COIL w/o HN model was a rerun as we lost the original checkpoint. There's a slight difference in dev performance, about 0.5% and also some improvement on the DL2019 test.

Tokenized data and model checkpoint link

Hard negative data and model checkpoint link

more to be added soon

Usage

The following sections will work through how to use this code base to train and retrieve over the MSMARCO passage ranking data set.

Training

You can download the train file psg-train.tar.gz for BERT from our resource link. Alternatively, you can run pre-processing by yourself following the pre-processing instructions.

Extract the training set from the tar ball and run the following code to launch training for msmarco passage.

python run_marco.py \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 4000 \  
  --train_dir /path/to/psg-train \  
  --q_max_len 16 \  
  --p_max_len 128 \  
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --train_group_size 8 \  
  --cls_dim 768 \  
  --token_dim 32 \  
  --warmup_ratio 0.1 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 5 \  
  --overwrite_output_dir \  
  --dataloader_num_workers 16 \  
  --no_sep \  
  --pooling max 

Encoding

After training, you can encode the corpus splits and queries.

You can download pre-processed data for BERT, corpus.tar.gz, queries.{dev, eval}.small.json here.

for i in $(seq -f "%02g" 0 99)  
do  
  mkdir ${ENCODE_OUT_DIR}/split${i}  
  python run_marco.py \  
    --output_dir $ENCODE_OUT_DIR \  
    --model_name_or_path $CKPT_DIR \  
    --tokenizer_name bert-base-uncased \  
    --cls_dim 768 \  
    --token_dim 32 \  
    --do_encode \  
    --no_sep \  
    --p_max_len 128 \  
    --pooling max \  
    --fp16 \  
    --per_device_eval_batch_size 128 \  
    --dataloader_num_workers 12 \  
    --encode_in_path ${TOKENIZED_DIR}/split${i} \  
    --encoded_save_path ${ENCODE_OUT_DIR}/split${i}
done

If on a cluster, the encoding loop can be paralellized. For example, say if you are on a SLURM cluster, use srun,

for i in $(seq -f "%02g" 0 99)  
do  
  mkdir ${ENCODE_OUT_DIR}/split${i}  
  srun --ntasks=1 -c4 --mem=16000 -t0 --gres=gpu:1 python run_marco.py \  
    --output_dir $ENCODE_OUT_DIR \  
    --model_name_or_path $CKPT_DIR \  
    --tokenizer_name bert-base-uncased \  
    --cls_dim 768 \  
    --token_dim 32 \  
    --do_encode \  
    --no_sep \  
    --p_max_len 128 \  
    --pooling max \  
    --fp16 \  
    --per_device_eval_batch_size 128 \  
    --dataloader_num_workers 12 \  
    --encode_in_path ${TOKENIZED_DIR}/split${i} \  
    --encoded_save_path ${ENCODE_OUT_DIR}/split${i}&
done

Then encode the queries,

python run_marco.py \  
  --output_dir $ENCODE_QRY_OUT_DIR \  
  --model_name_or_path $CKPT_DIR \  
  --tokenizer_name bert-base-uncased \  
  --cls_dim 768 \  
  --token_dim 32 \  
  --do_encode \  
  --p_max_len 16 \  
  --fp16 \  
  --no_sep \  
  --pooling max \  
  --per_device_eval_batch_size 128 \  
  --dataloader_num_workers 12 \  
  --encode_in_path $TOKENIZED_QRY_PATH \  
  --encoded_save_path $ENCODE_QRY_OUT_DIR

Note that here p_max_len always controls the maximum length of the encoded text, regardless of the input type.

Retrieval

To do retrieval, run the following steps,

(Note that there is no dependency in the for loop within each step, meaning that if you are on a cluster, you can distribute the jobs across nodes using srun or qsub.)

  1. build document index shards
for i in $(seq 0 9)  
do  
 python retriever/sharding.py \  
   --n_shards 10 \  
   --shard_id $i \  
   --dir $ENCODE_OUT_DIR \  
   --save_to $INDEX_DIR \  
   --use_torch
done  
  1. reformat encoded query
python retriever/format_query.py \  
  --dir $ENCODE_QRY_OUT_DIR \  
  --save_to $QUERY_DIR \  
  --as_torch
  1. retrieve from each shard
for i in $(seq -f "%02g" 0 9)  
do  
  python retriever/retriever-compat.py \  
      --query $QUERY_DIR \  
      --doc_shard $INDEX_DIR/shard_${i} \  
      --top 1000 \  
      --save_to ${SCORE_DIR}/intermediate/shard_${i}.pt
done 
  1. merge scores from all shards
python retriever/merger.py \  
  --score_dir ${SCORE_DIR}/intermediate/ \  
  --query_lookup  ${QUERY_DIR}/cls_ex_ids.pt \  
  --depth 1000 \  
  --save_ranking_to ${SCORE_DIR}/rank.txt

python data_helpers/msmarco-passage/score_to_marco.py \  
  --score_file ${SCORE_DIR}/rank.txt

Note that this compat(ible) version of retriever differs from our internal retriever. It relies on torch_scatter package for scatter operation so that we can have a pure python code that can easily work across platforms. We do notice that on our system torch_scatter does not scale very well with number of cores. We may in the future release another faster version of retriever that requires some compiling work.

Data Format

For both training and encoding, the core code expects pre-tokenized data.

Training Data

Training data is grouped by query into one or several json files where each line has a query, its corresponding positives and negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Encoding Data

Encoding data is also formatted into one or several json files. Each line corresponds to an entry item.

{"pid": str, "psg": List[int]}

Note that for code simplicity, we share this format for query/passage/document encoding.

Comments
  • Did you remove punctuations before computing the document score?

    Did you remove punctuations before computing the document score?

    ColBERT removed punctuations in document because they think they are useless. I wonder if you removed punctuations when computing overlapping tokens between query and document?

    opened by namespace-Pt 2
  • Padding Tokens - in the inverted list index

    Padding Tokens - in the inverted list index

    do_encode - will pad the passages to maximum passage length reps will now include the representation for the padding token - which will then be added to the index the padding token in the index affects the score because any query containing any padding will now match that token

    Should the padding token 0 be removed from the index after sharding or during the sharding process?

    opened by GeorgeSanchezTR 2
  • Error with loading dataset

    Error with loading dataset

    The error I get while running run_msmarco.py

    Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']

    I have downloaded the psg-train and extracted them. THe path_to_tsv variable in GroupedMarcoTrainDataset has all the .json files,

    opened by nirmal2k 2
  • TSV's aren't eval compliant

    TSV's aren't eval compliant

    Just wanted to ping you to let you know that the hosted TSV files (at least the dev ones) for COIL aren't in the correct format for evaluation.

    EG:

    1048585	7187158	35.926036089658744
    1048585	7187160	35.790479123592384
    1048585	7187155	35.65535098314285
    1048585	7187157	34.09628629684448
    1048585	7617404	33.498324900865555
    1048585	3856131	31.57883720099926
    1048585	7617413	31.314840689301487
    1048585	7187156	31.123393774032593
    1048585	7617411	30.926150113344196
    1048585	353739	30.901350378990173
    

    I would expect:

    1048585	7187158	1
    1048585	7187160	2
    1048585	7187155	3
    1048585	7187157	4
    1048585	7617404	5
    1048585	3856131	6
    1048585	7617413	7
    1048585	7187156	8
    1048585	7617411	9
    1048585	353739	10
    

    Clearly it's no real problem as it's easy to fix locally, but I'm not sure if this was intended or not.

    opened by JMMackenzie 2
  • Small typo and bug?

    Small typo and bug?

    Hi @luyug, Great work!!!!

    I am trying to replicate COIL, there is some typo I noticed.

    1. in README.md Encoding section
        --token_dim 768 \  
        --cls_dim 32 \ 
    

    should be 32, 768 instead?

    1. https://github.com/luyug/COIL/blob/813a076ad7526536dad5d4fc71eee5f7f8113700/trainer.py#L42 num_training_steps should be passed into the function rather than warmup_steps?
    opened by MXueguang 2
  • Reproducing dense retriever results

    Reproducing dense retriever results

    Hello! In the paper, you report a dense retriever that you train in Table 1 and 2 ("Dense (our train)"). Is the code reproduce this result in this repo? And if so, do you have any pointers on how to train/evaluate one?

    Thanks!

    opened by ivanmontero 1
  • Default training command - Issues when encountering documents longer than 512

    Default training command - Issues when encountering documents longer than 512

    Running the command under the training section of the README, the program fails in the first optimization step with the following message:

    /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [535,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    

    Which is thrown from the following:

      ...                                                                                                                                        
      File "/task_runtime/src/transformers-4.2.1/src/transformers/models/bert/modeling_bert.py", line 956, in forward                                                                           
        past_key_values_length=past_key_values_length,   
      ...
      File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding                                                                                                 
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                            
           RuntimeError: CUDA error: device-side assert triggered
    

    In other words, the base model (bert-base-cased) encounters an input with a larger sequence length than what it can handle (535 > 512).

    Given the above, how do you get around it, and apply your method to entire documents? (i.e., the MS MARCO Document Ranking table)

    opened by ivanmontero 1
  • training data for UniCoil - links not working

    training data for UniCoil - links not working

    Hi :)

    I am looking for the data, for running the training script for the uniCoil model. The links under "Resource" section is expired/does not work.

    Where can I find the data elsewhere?

    opened by lboesen 0
  • pyarrow.lib.ArrowNotImplementedError during training phrase

    pyarrow.lib.ArrowNotImplementedError during training phrase

    I ran these commands in Google Colab with GPU

    !wget http://boston.lti.cs.cmu.edu/luyug/coil/msmarco-psg/psg-train.tar.gz
    !tar xfz psg-train.tar.gz
    !git clone https://github.com/luyug/COIL
    !pip install transformers datasets
    
    ! cd COIL && python run_marco.py --output_dir model --model_name_or_path bert-base-uncased --do_train --save_steps 4000 --train_dir ../psg-train --q_max_len 16 --p_max_len 128 --fp16 --per_device_train_batch_size 8 --train_group_size 8 --cls_dim 768 --token_dim 32 --warmup_ratio 0.1 --learning_rate 5e-6 --num_train_epochs 5 --overwrite_output_dir --dataloader_num_workers 16 --no_sep --pooling max 
    

    This is the output I got:

    fatal: destination path 'COIL' already exists and is not an empty directory.
    Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.10.2)
    Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (1.11.0)
    Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
    Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.3)
    Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (21.0)
    Requirement already satisfied: huggingface-hub>=0.0.12 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.16)
    Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.0)
    Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
    Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
    Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.45)
    Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
    Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (5.4.1)
    Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.6.4)
    Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from huggingface-hub>=0.0.12->transformers) (3.7.4.3)
    Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
    Requirement already satisfied: fsspec>=2021.05.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2021.8.1)
    Requirement already satisfied: pyarrow!=4.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
    Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
    Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (2.0.2)
    Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
    Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.1.5)
    Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
    Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.5.30)
    Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
    Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
    Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.5.0)
    Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
    Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
    Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
    Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
    Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
    09/12/2021 02:05:20 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
    09/12/2021 02:05:20 - INFO - __main__ -   Training/evaluation parameters COILTrainingArguments(
    _n_gpu=1,
    adafactor=False,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    dataloader_drop_last=False,
    dataloader_num_workers=16,
    dataloader_pin_memory=True,
    ddp_find_unused_parameters=None,
    debug=[],
    deepspeed=None,
    disable_tqdm=False,
    do_encode=False,
    do_eval=False,
    do_predict=False,
    do_train=True,
    eval_accumulation_steps=None,
    eval_steps=None,
    evaluation_strategy=IntervalStrategy.NO,
    fp16=True,
    fp16_backend=auto,
    fp16_full_eval=False,
    fp16_opt_level=O1,
    gradient_accumulation_steps=1,
    greater_is_better=None,
    group_by_length=False,
    ignore_data_skip=False,
    label_names=None,
    label_smoothing_factor=0.0,
    learning_rate=5e-06,
    length_column_name=length,
    load_best_model_at_end=False,
    local_rank=-1,
    log_level=-1,
    log_level_replica=-1,
    log_on_each_node=True,
    logging_dir=model/runs/Sep12_02-05-20_2992d74c8c9d,
    logging_first_step=False,
    logging_steps=500,
    logging_strategy=IntervalStrategy.STEPS,
    lr_scheduler_type=SchedulerType.LINEAR,
    max_grad_norm=1.0,
    max_steps=-1,
    metric_for_best_model=None,
    mp_parameters=,
    no_cuda=False,
    num_train_epochs=5.0,
    output_dir=model,
    overwrite_output_dir=True,
    past_index=-1,
    per_device_eval_batch_size=8,
    per_device_train_batch_size=8,
    prediction_loss_only=False,
    push_to_hub=False,
    push_to_hub_model_id=model,
    push_to_hub_organization=None,
    push_to_hub_token=None,
    remove_unused_columns=True,
    report_to=['tensorboard'],
    resume_from_checkpoint=None,
    run_name=model,
    save_on_each_node=False,
    save_steps=4000,
    save_strategy=IntervalStrategy.STEPS,
    save_total_limit=None,
    seed=42,
    sharded_ddp=[],
    skip_memory_metrics=True,
    tpu_metrics_debug=False,
    tpu_num_cores=None,
    use_legacy_prediction_loop=False,
    warmup_ratio=0.1,
    warmup_steps=0,
    weight_decay=0.0,
    )
    09/12/2021 02:05:20 - INFO - __main__ -   Model params ModelArguments(model_name_or_path='bert-base-uncased', config_name=None, tokenizer_name=None, cache_dir=None, token_dim=32, cls_dim=768, token_rep_relu=False, token_norm_after=False, cls_norm_after=False, x_device_negatives=False, pooling='max', no_sep=True, no_cls=False, cls_only=False)
    09/12/2021 02:05:20 - INFO - filelock -   Lock 140242889433168 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
    Downloading: 100% 570/570 [00:00<00:00, 476kB/s]
    09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889433168 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
    09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
    Downloading: 100% 28.0/28.0 [00:00<00:00, 33.2kB/s]
    09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889394384 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
    09/12/2021 02:05:21 - INFO - filelock -   Lock 140242889467600 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
    Downloading: 100% 232k/232k [00:00<00:00, 1.10MB/s]
    09/12/2021 02:05:22 - INFO - filelock -   Lock 140242889467600 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
    09/12/2021 02:05:22 - INFO - filelock -   Lock 140242856865744 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
    Downloading: 100% 466k/466k [00:00<00:00, 3.41MB/s]
    09/12/2021 02:05:23 - INFO - filelock -   Lock 140242856865744 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
    09/12/2021 02:05:23 - INFO - filelock -   Lock 140242835342416 acquired on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
    Downloading: 100% 440M/440M [00:12<00:00, 35.5MB/s]
    09/12/2021 02:05:36 - INFO - filelock -   Lock 140242835342416 released on /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock
    Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
    - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    09/12/2021 02:05:38 - WARNING - datasets.builder -   Using custom data configuration default-ac64881b8f58639a
    Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ac64881b8f58639a/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...
    Traceback (most recent call last):
      File "run_marco.py", line 302, in <module>
        main()
      File "run_marco.py", line 146, in main
        data_args, data_args.train_path, tokenizer=tokenizer,
      File "/content/COIL/marco_datasets.py", line 37, in __init__
        'passage': [datasets.Value('int32')],
      File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 852, in load_dataset
        use_auth_token=use_auth_token,
      File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 616, in download_and_prepare
        dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
      File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 693, in _download_and_prepare
        self._prepare_split(split_generator, **prepare_split_kwargs)
      File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1163, in _prepare_split
        generator, unit=" tables", leave=False, disable=bool(logging.get_verbosity() == logging.NOTSET)
      File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1185, in __iter__
        for obj in iterable:
      File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
        yield (file_idx, batch_idx), self._cast_classlabels(pa_table)
      File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/json/json.py", line 76, in _cast_classlabels
        [pa_table[name] for name in self.config.features], schema=self.config.schema
      File "pyarrow/table.pxi", line 1515, in pyarrow.lib.Table.from_arrays
      File "pyarrow/table.pxi", line 553, in pyarrow.lib._sanitize_arrays
      File "pyarrow/array.pxi", line 328, in pyarrow.lib.asarray
      File "pyarrow/table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
      File "/usr/local/lib/python3.7/dist-packages/pyarrow/compute.py", line 281, in cast
        return call_function("cast", [arr], options)
      File "pyarrow/_compute.pyx", line 465, in pyarrow._compute.call_function
      File "pyarrow/_compute.pyx", line 294, in pyarrow._compute.Function.call
      File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
    pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, query: list<item: int64>> to struct using function cast_struct
    
    opened by udaygoyat45 1
  • Question about COIL-full

    Question about COIL-full

    Awesome idea and exiting exp result. Still, I am confused about the implement of COIL-full, when doing dense retrieval, can we do ANN search to speed up by using FAISS, or brute-force search indeed ? What's the implement in the paper experiment?

    opened by kinglai 1
  • Question about the result on

    Question about the result on "MSMARCO Passage Leadboard".

    I notice that C-Coil is at the top of the "MS MARCO Passage Ranking Leaderboard", the results are "0.427 on eval" and "0.443 on dev". But the result in https://github.com/luyug/COIL/tree/main/examples/c-coil is "0.3734 on MARCO DEV". image I wonder why the gap is so big? Is it because of ensemble? I didn't find any relevant explanation in the Coil paper.

    opened by hangzhang-nlp 0
  • Retrieval latency is very large with one thread

    Retrieval latency is very large with one thread

    Hi, thank you for sharing this codes. I tested the latency of COIL using the retriever-fast.py with one thread and one shard. Batch size is set to one. The cpu info is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. However, the query latency is roughly 4 seconds, which is substantially larger than 0.38s reported in the paper. I wonder why this happens. Does the paper use multi-threads to evaluate the latency?

    opened by jingtaozhan 0
Owner
Luyu Gao
NLP Research Master@LTI, CMU
Luyu Gao
Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

cookielee77 30 Jan 1, 2023
Lexical Substitution Framework

LexSubGen Lexical Substitution Framework This repository contains the code to reproduce the results from the paper: Arefyev Nikolay, Sheludko Boris, P

Samsung 37 Sep 15, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022
Contextualized Perturbation for Textual Adversarial Attack, NAACL 2021

Contextualized Perturbation for Textual Adversarial Attack Introduction This is a PyTorch implementation of Contextualized Perturbation for Textual Ad

cookielee77 30 Jan 1, 2023
Lexical Substitution Framework

LexSubGen Lexical Substitution Framework This repository contains the code to reproduce the results from the paper: Arefyev Nikolay, Sheludko Boris, P

Samsung 37 Sep 15, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Rust syntax and lexical analyzer implemented in Python.

Rust Scanner Rust syntax and lexical analyzer implemented in Python. This project was made for the Programming Languages class at ESPOL (SOFG1009). Me

Joangie Marquez 0 Jul 3, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023