ACL'2021: Learning Dense Representations of Phrases at Scale

Princeton Natural Language Processing

Last update: Dec 30, 2022

Related tags

Overview

DensePhrases

DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search phrase-level answers to your questions or find related entities to (subject, relation) pairs in real-time. Due to the extractive nature of DensePhrases, it always provides an evidence passage for each phrase. Please see our paper Learning Dense Representations of Phrases at Scale (Lee et al., 2021) for more details.

***** You can try out our online demo of DensePhrases here! *****

Updates

***** New June 14, 2021: Major code updates *****

Quick Links

Installation
Resources
Creating a Custom Phrase Index with DensePhrases
Playing with a DensePhrases Demo
Traning, Indexing and Inference
Pre-processing

Installation

# Install torch with conda (please check your CUDA version)
conda create -n densephrases python=3.7
conda activate densephrases
conda install pytorch=1.7.1 cudatoolkit=11.0 -c pytorch

# Install apex
git clone https://www.github.com/nvidia/apex.git
cd apex
python setup.py install
cd ..

# Install DensePhrases
git clone https://github.com/princeton-nlp/DensePhrases.git
cd DensePhrases
pip install -r requirements.txt
python setup.py develop

Resources

Before downloading the required files below, please set the default directories as follows and ensure that you have enough storage to download and unzip the files:

# Running config.sh will set the following three environment variables:
# DATA_DIR: for datasets (including 'kilt', 'open-qa', 'single-qa', 'truecase', 'wikidump')
# SAVE_DIR: for pre-trained models or index; new models and index will also be saved here
# CACHE_DIR: for cache files from huggingface transformers
source config.sh

To download the resources described below, you can use download.sh as follows:

# Use bash script to download data (change data to models or index accordingly)
source download.sh
Choose a resource to download [data/wiki/models/index]: data
data will be downloaded at ...
...
Downloading data done!

1. Datasets

Datasets (1GB) - Pre-processed datasets including reading comprehension, generated questions, open-domain QA and slot filling. Download and unzip it under $DATA_DIR or use download.sh.
Wikipedia dumps (5GB) - Pre-processed Wikipedia dumps in different sizes. See here for more details. Download and unzip it under $DATA_DIR or use download.sh.

# Check if the download is complete
ls $DATA_DIR
kilt  open-qa  single-qa  truecase  wikidump

2. Pre-trained Models

Pre-trained models (8GB) - Pre-trained DensePhrases models (including cross-encoder teacher models). Download and unzip it under $SAVE_DIR or use download.sh.

# Check if the download is complete
ls $SAVE_DIR
densephrases-multi  densephrases-multi-query-nq  ...  spanbert-base-cased-squad

You can also download each of pre-trained DensePhrases models as listed below.

Model	Evaluation (Test)	OpenQA (EM)
densephrases-multi	NaturalQuestions	31.9
densephrases-multi-query-nq	NaturalQuestions	41.3
densephrases-multi-query-trec	CuratedTREC	52.9
densephrases-multi-query-wq	WebQuestions	41.5
densephrases-multi-query-tqa	TriviaQA	53.5
densephrases-multi-query-sqd	SQuAD	34.5
densephrases-multi-query-multi	NaturalQuestions	40.9

Model	Evaluation (Test)	SlotFilling (KILT-AC)
densephrases-multi-query-trex	T-REx	22.3
densephrases-multi-query-zsre	Zero shot RE	40.0

densephrases-multi : DensePhrases trained on multiple reading comprehension datasets (C_phrase = {NQ, WQ, TREC, TQA, SQuAD}) without any query-side fine-tuning
densephrases-multi-query-* : densephrases-multi query-side fine-tuned on *
densephrases-multi-query-multi : densephrases-multi query-side fine-tuned on 5 open-domain QA datasets (NQ, WQ, TREC, TQA, SQuAD); Used for the demo
spanbert-base-cased-* : cross-encoder teacher models trained on *

Test set performance was measured on the phrase index for the full Wikipedia scale. Note that the query-side fine-tuned models are trained with a different index structure (i.e., IVFOPQ) compared to IVFSQ described in the paper, hence showing slightly different performances.

3. Phrase Index

Please note that you don't need to download this phrase index unless you want to work on the full Wikipedia scale.

densephrases-multi_wiki-20181220 (74GB) - Phrase index for the 20181220 version of Wikipedia. Download and unzip it under $SAVE_DIR or use download.sh.

# Check if the download is complete
ls $SAVE_DIR
...  densephrases-multi_wiki-20181220

From 320GB to 74GB

Since hosting the 320GB phrase index described in our paper is costly, we provide an index with a much smaller size (74GB), which includes our recent efforts to reduce the size of the phrase index using Optimized Product Quantization with Inverted File System (IVFOPQ). With IVFOPQ, you do not need any SSDs for the real-time inference (the index is loaded on RAM), and you can also reconstruct the phrase vectors from it for the query-side fine-tuning (hence do not need the additional 500GB).

Creating a Custom Phrase Index with DensePhrases

Basically, DensePhrases uses a text corpus pre-processed in the following format:

{
    "data": [
        {
            "title": "America's Got Talent (season 4)",
            "paragraphs": [
                {
                    "context": " The fourth season of \"America's Got Talent\", ... Country singer Kevin Skinner was named the winner on September 16, 2009 ..."
                },
                {
                    "context": " Season four was Hasselhoff's final season as a judge. This season started broadcasting live on August 4, 2009. ..."
                },
                ...
            ]
        },
    ]
}

Each context contains a single natural paragraph of a variable length. See sample/articles.json for example. The following command creates phrase vectors for the custom corpus (sample/articles.json) with the densephrases-multi model.

python generate_phrase_vecs.py \
    --model_type bert \
    --pretrained_name_or_path SpanBERT/spanbert-base-cased \
    --data_dir ./ \
    --cache_dir $CACHE_DIR \
    --predict_file sample/articles.json \
    --do_dump \
    --max_seq_length 512 \
    --doc_stride 500 \
    --fp16 \
    --filter_threshold -2.0 \
    --append_title \
    --load_dir $SAVE_DIR/densephrases-multi \
    --output_dir $SAVE_DIR/densephrases-multi_sample

The phrase vectors (and their metadata) will be saved under $SAVE_DIR/densephrases-multi_sample/dump/phrase. Now you need to create a faiss index as follows:

python build_phrase_index.py \
    $SAVE_DIR/densephrases-multi_sample/dump all \
    --replace \
    --num_clusters 32 \
    --fine_quant OPQ96 \
    --doc_sample_ratio 1.0 \
    --vec_sample_ratio 1.0 \
    --cuda

# Compress metadata for faster inference
python scripts/preprocess/compress_metadata.py \
    --input_dump_dir $SAVE_DIR/densephrases-multi_sample/dump/phrase \
    --output_dir $SAVE_DIR/densephrases-multi_sample/dump

Note that this example uses a very small text corpus and the hyperparameters for build_phrase_index.py in a larger scale corpus can be found here. The phrase index (with IVFOPQ) will be saved under $SAVE_DIR/densephrases-multi_sample/dump/start. You can use this phrase index to run a demo or evaluate your set of queries. For instance, you can feed a set of questions (sample/questions.json) to the custom phrase index as follows:

python eval_phrase_retrieval.py \
    --run_mode eval \
    --cuda \
    --dump_dir $SAVE_DIR/densephrases-multi_sample/dump \
    --index_dir start/32_flat_OPQ96 \
    --query_encoder_path $SAVE_DIR/densephrases-multi \
    --test_path sample/questions.json \
    --save_pred \
    --truecase

The prediction file will be saved as $SAVE_DIR/densephrases-multi/pred/questions_3_top10.pred, which shows the answer phrases and the passages that contain the phrases:

{
    "1": {
        "question": "Who won season 4 of America's got talent",
        ...
        "prediction": [
            "Kevin Skinner",
            ...
        ],
        "evidence": [
            "The fourth season of \"America's Got Talent\", an American television reality show talent competition, premiered on the NBC network on June 23, 2009. Country singer Kevin Skinner was named the winner on September 16, 2009.",
            ...
        ],
    }
    ...
}

For creating a large-scale phrase index (e.g., Wikipedia), see dump_phrases.py for an example, which is also explained here.

Playing with a DensePhrases Demo

There are two ways of using DensePhrases demo.

You can simply use the demo that we are serving on our server (Wikipedia scale). The running demo is using densephrases-multi-query-multi (NQ=40.8 EM) as a query encoder and densephrases-multi_wiki-20181220 as a phrase index.
You can run the demo on your own server where you can change the phrase index (obtained from here) or the query encoder (e.g., to densephrases-multi-query-nq).

The minimum resource requirement for running the full Wikipedia scale demo is:

100GB RAM
Single 11GB GPU (optional)

Note that you no longer need any SSDs to run the demo unlike previous phrase retrieval models (DenSPI, DenSPI+Sparc), but setting $SAVE_DIR to an SSD can reduce the loading time of the required resources. The following commands serve exactly the same demo as here on your http://localhost:51997.

# Serve a query encoder on port 1111
nohup python run_demo.py \
    --run_mode q_serve \
    --cache_dir $CACHE_DIR \
    --query_encoder_path $SAVE_DIR/densephrases-multi-query-multi \
    --cuda \
    --max_query_length 32 \
    --query_port 1111 > $SAVE_DIR/logs/q-serve_1111.log &

# Serve a phrase index on port 51997 (takes several minutes)
nohup python run_demo.py \
    --run_mode p_serve \
    --index_dir start/1048576_flat_OPQ96 \
    --cuda \
    --truecase \
    --dump_dir $SAVE_DIR/densephrases-multi_wiki-20181220/dump/ \
    --query_port 1111 \
    --index_port 51997 > $SAVE_DIR/logs/p-serve_51997.log &

# Below are the same but simplified commands using Makefile
make q-serve MODEL_NAME=densephrases-multi-query-multi Q_PORT=1111
make p-serve DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/ Q_PORT=1111 I_PORT=51997

Please change --query_encoder_path or --dump_dir if necessary and remove --cuda for CPU-only version. Once you set up the demo, the log files in $SAVE_DIR/logs/ will be automatically updated whenever a new question comes in. You can also send queries to your server using mini-batches of questions for faster inference.

# Test on NQ test set
python run_demo.py \
    --run_mode eval_request \
    --index_port 51997 \
    --test_path $DATA_DIR/open-qa/nq-open/test_preprocessed.json \
    --eval_batch_size 64 \
    --save_pred \
    --truecase

# Same command with Makefile
make eval-demo I_PORT=51997

# Result
(...)
INFO - eval_phrase_retrieval -   {'exact_match_top1': 40.83102493074792, 'f1_score_top1': 48.26451418695196}
INFO - eval_phrase_retrieval -   {'exact_match_top10': 60.11080332409972, 'f1_score_top10': 68.47386731458751}
INFO - eval_phrase_retrieval -   Saving prediction file to $SAVE_DIR/pred/test_preprocessed_3610_top10.pred

For more details (e.g., changing the test set), please see the targets in Makefile (q-serve, p-serve, eval-demo, etc).

DensePhrases: Training, Indexing and Inference

In this section, we introduce a step-by-step procedure to train DensePhrases, create phrase vectors and indexes, and run inferences with the trained model. All of our commands here are simplified as Makefile targets, which include exact dataset paths, hyperparameter settings, etc.

If the following test run completes without an error after the installation and the download, you are good to go!

# Test run for checking installation (takes about 10 mins; ignore the performance)
make draft MODEL_NAME=test

A figure summarizing the overall process below

1. Training phrase and query encoders

To train DensePhrases from scratch, use run-rc-nq in Makefile, which trains DensePhrases on NQ (pre-processed for the reading comprehension task) and evaluate it on reading comprehension as well as on (semi) open-domain QA. You can simply change the training set by modifying the dependencies of run-rc-nq (e.g., nq-rc-data => sqd-rc-data and nq-param => sqd-param for training on SQuAD). You'll need a single 24GB GPU for training DensePhrases on reading comprehension tasks, but you can use smaller GPUs by setting --gradient_accumulation_steps properly.

# Train DensePhrases on NQ with Eq. 9
make run-rc-nq MODEL_NAME=densephrases-nq

run-rc-nq is composed of the six commands as follows (in case of training on NQ):

make train-rc ...: Train DensePhrases on NQ with Eq. 9 (L = lambda1 L_single + lambda2 L_distill + lambda3 L_neg) with generated questions.
make train-rc ...: Load trained DensePhrases in the previous step and further train it with Eq. 9 with pre-batch negatives.
make gen-vecs: Generate phrase vectors for D_small (= set of all passages in NQ dev).
make index-vecs: Build a phrase index for D_small.
make compress-meta: Compresss metadata for faster inference.
make eval-index ...: Evaluate the phrase index on the development set questions.

At the end of step 2, you will see the performance on the reading comprehension task where a gold passage is given (about 72.0 EM on NQ dev). Step 6 gives the performance on the semi-open-domain setting (denoted as D_small; see Table 6 in the paper) where the entire passages from the NQ development set is used for the indexing (about 62.0 EM with NQ dev questions). The trained model will be saved under $SAVE_DIR/$MODEL_NAME. Note that during the single-passage training on NQ, we exclude some questions in the development set, whose annotated answers are found from a list or a table.

2. Creating a phrase index

Let's assume that you have a pre-trained DensePhrases named densephrases-multi, which can also be downloaded from here. Now, you can generate phrase vectors for a large-scale corpus like Wikipedia using gen-vecs-parallel. Note that you can just download the phrase index for the full Wikipedia scale and skip this section.

# Generate phrase vectors in parallel for a large-scale corpus (default = wiki-dev)
make gen-vecs-parallel MODEL_NAME=densephrases-multi START=0 END=8

The default text corpus for creating phrase vectors is wiki-dev located in $DATA_DIR/wikidump. We have three options for larger text corpora:

wiki-dev: 1/100 Wikipedia scale (sampled), 8 files
wiki-dev-noise: 1/10 Wikipedia scale (sampled), 500 files
wiki-20181220: full Wikipedia (20181220) scale, 5621 files

The wiki-dev* corpora also contain passages from the NQ development set, so that you can track the performance of your model witn an increasing size of the text corpus (usually decreases as it gets larger). The phrase vectors will be saved as hdf5 files in $SAVE_DIR/$(MODEL_NAME)_(data_name)/dump (e.g., $SAVE_DIR/densephrases-multi_wiki-dev/dump), which will be referred to $DUMP_DIR below.

Parallelization

START and END specify the file index in the corpus (e.g., START=0 END=8 for wiki-dev and START=0 END=5621 for wiki-20181220). Each run of gen-vecs-parallel only consumes 2GB in a single GPU, and you can distribute the processes with different START and END using slurm or shell script (e.g., START=0 END=200, START=200 END=400, ..., START=5400 END=5621). Distributing 28 processes on 4 24GB GPUs (each processing about 200 files) can create phrase vectors for wiki-20181220 in 8 hours. Processing the entire Wikiepdia requires up to 500GB and we recommend using an SSD to store them if possible (a smaller corpus can be stored in a HDD).

After generating the phrase vectors, you need to create a phrase index for the sublinear time search of phrases. Here, we use IVFOPQ for the phrase index.

# Create IVFOPQ index for a set of phrase vectors
make index-vecs DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump/

For wiki-dev-noise and wiki-20181220, you need to modify the number of clusters to 101,372 and 1,048,576, respectively (simply change medium1-index in ìndex-vecs to medium2-index or large-index). For wiki-20181220 (full Wikipedia), this takes about 1~2 days depending on the specification of your machine and requires about 100GB RAM. For IVFSQ as described in the paper, you can use index-add and index-merge to distribute the addition of phrase vectors to the index.

You also need to compress the metadata (saved in hdf5 files together with phrase vectors) for a faster inference of DensePhrases. This is mandatory for the IVFOPQ index.

# Compress metadata of wiki-dev
make compress-meta DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump

For evaluating the performance of DensePhrases with your phrase indexes, use eval-index.

# Evaluate on the NQ test set questions
make eval-index MODEL_NAME=densephrases-multi DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-dev/dump/

3. Query-side fine-tuning

Query-side fine-tuning makes DensePhrases a versatile tool for retrieving phrase-level knowledge given different types of input queries and answers. Although DensePhrases was trained on QA datasets, it can be adapted to non-QA style inputs such as "subject [SEP] relation" where we expect related object entities to be retrieved. It also significantly improves the performance on QA datasets by reducing the discrepancy of training and inference.

First, you need a phrase index for the full Wikipedia (wiki-20181220), which can be simply downloaded here, or a custom phrase index as described above. Given your query-answer pairs pre-processed as json files in $DATA_DIR/open-qa or $DATA_DIR/kilt, you can easily query-side fine-tune your model. For instance, the training set of T-REx ($DATA_DIR/kilt/trex/trex-train-kilt_open_10000.json) looks as follows:

{
    "data": [
        {
            "id": "111ed80f-0a68-4541-8652-cb414af315c5",
            "question": "Effie Germon [SEP] occupation",
            "answers": [
                "actors",
                "actor",
                "actress",
                "actresses"
            ]
        },
        ...
    ]
}

The following command query-side fine-tunes densephrases-multi on T-REx.

# Query-side fine-tune on T-REx (model will be saved as MODEL_NAME)
make train-query MODEL_NAME=densephrases-multi-query-trex DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/

Note that the pre-trained query encoder is specified in train-query as --query_encoder_path $(SAVE_DIR)/densephrases-multi and a new model will be saved as densephrases-multi-query-trex as specified in MODEL_NAME. You can also train on different datasets by changing the dependency trex-open-data to *-open-data (e.g., wq-open-data for WebQuestions).

IVFOPQ vs IVFSQ

Currently, train-query uses the IVFOPQ index for query-side fine-tuning, and you should apply minor changes in the code to train with an IVFSQ index. For IVFOPQ, training takes 2 to 3 hours per epoch for large datasets (NQ, TQA, SQuAD), and 3 to 8 minutes for small datasets (WQ, TREC). We recommend using IVFOPQ since it has similar or better performance than IVFSQ while being much faster than IVFSQ. With IVFSQ, the training time will be highly dependent on the File I/O speed, so using SSDs is recommended for IVFSQ.

4. Inference

With any DensePhrases query encoders (e.g., densephrases-multi-query-nq) and a phrase index (e.g., densephrases-multi_wiki-20181220), you can test your queries as follows and the retrieval results will be saved as a json file with the --save_pred option:

# Evaluate on Natural Questions
make eval-index MODEL_NAME=densephrases-multi-query-nq DUMP_DIR=$SAVE_DIR/densephrases-multi_wiki-20181220/dump/

# If the demo is being served on http://localhost:51997
make eval-demo I_PORT=51997

For the evaluation on different datasets, simply change the dependency of eval-index (or eval-demo) accordingly (e.g., nq-open-data to trec-open-data for the evaluation on CuratedTREC). Note that the test set evaluation of slot filling tasks requires prediction files to be uploaded on eval.ai (use strip-kilt target in Makefile for better accuracy).

Pre-processing

At the bottom of Makefile, we list commands that we used for pre-processing the datasets and Wikipedia. For training question generation models (T5-large), we used https://github.com/patil-suraj/question_generation (see also here for QG). Note that all datasets are already pre-processed including the generated questions, so you do not need to run most of these scripts. For creating test sets for custom (open-domain) questions, see preprocess-openqa in Makefile.

Questions?

Feel free to email Jinhyuk Lee ([email protected]) for any questions related to the code or the paper. You can also open a Github issue. Please try to specify the details so we can better understand and help you solve the problem.

Reference

Please cite our paper if you use DensePhrases in your work:

@inproceedings{lee2021learning,
   title={Learning Dense Representations of Phrases at Scale},
   author={Lee, Jinhyuk and Sung, Mujeen and Kang, Jaewoo and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

License

Please see LICENSE for details.

Comments

Issue while creating faiss index, Command is not clear
Hi,

What is the all in this command, I am getting unrecognized command error when i remove all.

python build_phrase_index.py \ $SAVE_DIR/densephrases-multi_sample/dump all \ --replace \ --num_clusters 32 \ --fine_quant OPQ96 \ --doc_sample_ratio 1.0 \ --vec_sample_ratio 1.0 \ --cuda

I corrected that by giving --dump_dir before but its not creating anything. Please find the screenshot below,
opened by SAIVENKATARAJU 14
Modifying num_clusters in index-vecs

I tried to run index-vecs using custom wikidump, dataset and model, but got this error

Modifying num_clusters flags to 96 doesn't seem to help, the k in error message is still 256.

opened by light42 11
Unable to file folder for phrase in wikidump

Hi, First of all many thanks for work.

I am trying to test this. As per documentation I downloaded all 4 tar files (datasets, wikipediadump, pretrained models and phrase index). but while running getting the below mentioned error: which seems to be finding some phrase folder in wikidump, which is not available at all.

Can u suggest the reason for same.

I have given correct path for all folders.

opened by tiwari93 11
Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD

I've built the compressed DensePhrase index on SQuAD using OPQ96. I haven't run any query-side finetuning yet but here are the results:

11/22/2021 19:50:57 - INFO - main - no_ans/all: 0, 10570 11/22/2021 19:50:57 - INFO - main - Evaluating 10570 answers 11/22/2021 19:50:58 - INFO - main - EM: 21.63, F1: 27.96 11/22/2021 19:50:58 - INFO - main - 1) Which NFL team represented the AFC at Super Bowl 50 11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], top 5 prediction: ['Denver Broncos', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers'] 11/22/2021 19:50:58 - INFO - main - 2) Which NFL team represented the NFC at Super Bowl 50 11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], top 5 prediction: ['San Francisco 49ers', 'Chicago Bears', 'Seattle Seahawks', 'Tampa Bay Buccaneers', 'Green Bay Packers'] 11/22/2021 19:50:58 - INFO - main - 3) Where did Super Bowl 50 take place 11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], top 5 prediction: ['Tacoma, Washington, USA', "Levi's Stadium in Santa Clara, California", 'DeVault Vineyards in Concord, Virginia', "Levi's Stadium in Santa Clara", 'Jinan Olympic Sports Center Gymnasium in Jinan, China'] 11/22/2021 19:53:44 - INFO - main - {'exact_match_top1': 21.62724692526017, 'f1_score_top1': 27.958255585698414} 11/22/2021 19:53:44 - INFO - main - {'exact_match_top200': 57.48344370860927, 'f1_score_top200': 73.28679644685603} 11/22/2021 19:53:44 - INFO - main - {'redundancy of top200': 5.308987701040681} 11/22/2021 19:53:44 - INFO - main - Saving prediction file to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200.pred 10570it [00:23, 448.84it/s] 11/22/2021 19:54:58 - INFO - main - avg psg len=124.84 for 10570 preds 11/22/2021 19:54:58 - INFO - main - dump to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200_psg-top100.json ctx token length: 124.84 unique titles: 98.20

Top-1 = 27.02% Top-5 = 42.80% Top-20 = 56.40% Top-100 = 69.20% Acc@1 when Acc@100 = 39.05% MRR@20 = 34.30 P@20 = 8.94

I understand that index compression results in accuracy loss w/o query-side finetuning. However, the score still looks a little bit too low to me. Could @jhyuklee confirm whether this looks alright?

opened by alexlimh 9
Unable to Reproduce Passage Retrieval Results on NQ
Hi Jinhyuk,

I was trying to reproduce the third row of Table 1 in your paper (https://arxiv.org/pdf/2109.08133.pdf). I'm using the index and pre-trained ckpt on NQ you gave me several days ago. Here's my results:

Top-1 = 34.32% Top-5 = 54.13% Top-20 = 66.59% Top-100 = 76.43% Acc@1 when Acc@100 = 44.91% MRR@20 = 43.12 P@20 = 14.61

Here's the command I use:

make eval-index-psg MODEL_NAME=densephrases-nq-query-nq DUMP_DIR=densephrases-nq_wiki-20181220-p100/dump/ TEST_DATA=open-qa/nq-open/test_preprocessed.json

Any idea what I might do wrong? Thanks in advance.

Minghan
opened by alexlimh 9
The question about reproduce RC-SQD results

Hi~ Thanks a lot for your open source work. When I run your code for SQuAD dataset in one passage training, I got 77.3 EM and 85.7 F1. I ran code in this script- python train_rc.py --model_type bert --pretrained_name_or_path SpanBERT/spanbert-base-cased --data_dir densephrases/densephrases-data/single-qa --cache_dir densephrases/cache --train_file squad/train-v1.1_qg_ents_t5large_3500_filtered.json --predict_file squad/dev-v1.1.json --do_train --do_eval --per_gpu_train_batch_size 24 --learning_rate 3e-5 --num_train_epochs 3.0 --max_seq_length 384 --seed 42 --fp16 --fp16_opt_level O1 --lambda_kl 4.0 --lambda_neg 2.0 --lambda_flt 1.0 --filter_threshold -2.0 --append_title --evaluate_during_training --overwrite_output_dir --teacher_dir densephrases/outputs/spanbert-base-cased-squad I also train this model for another 2 epochs like your makefile using pre-batch negative and train-v1.1.json (the real squad data), but the results is still below the paper results. (1) Does I should use different hyperparameters? I found your paper use different parameters with your script, such as batch size (84 vs 24) or lambda weight, etc. (2) In the paper, the results are the average of random seeds? (3) Do you use the whole nq and squad datasets to train the model?

opened by kugwzk 7
Representations of phrases

Hi,

Thanks for the interesting project!

One question: If I want to get only phrase representations from your pre-trained model, how can I do that? I plan to use them as baselines. Thank you!

Best, Jiacheng

opened by JiachengLi1995 6
How to extract phrases from Wikipedia?

Hi!

First of all thanks a lot for this solid project!

I just want to figure out how to extract phrases from Wikipedia? Which script is the right one? I am a little confused when I see so many scripts in the preprocess folder.

opened by Albert-Ma 5
Train custom dataset

Hi Jhyuklee,

Thank you for good works and support. I have one query. Here I want use my custom pdf statements as a dump in place of Wikipedia dump, and want a model to get information from pdf data rather than getting it from wikipedia.

Do I need to freshly train our whole dump data or is there a way where I can fine tune this model based on checkpoints trained by you.

Pls guide.

opened by tiwari93 5
Significance of line 174 in train_query.py code

Hi,

I was going through the code for query finetuning and I am not able to understand one condition in the code:

Is the above highlighted line redundant and if not what is the significance (I feel we can directly update the encoder). Just wanted to make sure that I am not missing anything.

opened by Nishant3815 4
Question about faiss parameter

Hi,

Thanks for the amazing work! May I ask how do you choose the parameter for faiss index? Like the number of clusters and quant type OPQ96? It seems that the number of clusters varies with the number of phrases to save.

Thanks!

opened by PlusRoss 4
failed with "make draft MODEL_NAME=test"

logs as following, thanks

convert squad examples to features: 100%|█████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 2092.37it/s] add example index and unique id: 100%|██████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 439863.06it/s] 06/14/2022 22:26:05 - INFO - main - Number of trainable params: 258,127,108 Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") 06/14/2022 22:26:05 - INFO - main - ***** Running training ***** 06/14/2022 22:26:05 - INFO - main - Num examples = 1218 06/14/2022 22:26:05 - INFO - main - Num Epochs = 2 06/14/2022 22:26:05 - INFO - main - Instantaneous batch size per GPU = 48 06/14/2022 22:26:05 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 384 06/14/2022 22:26:05 - INFO - main - Gradient Accumulation steps = 1 06/14/2022 22:26:05 - INFO - main - Total optimization steps = 8 Epoch: 0%| | 0/2 [00:00<?, ?it/s]06/14/2022 22:26:05 - INFO - main -
[Epoch 1] 06/14/2022 22:26:05 - INFO - main - Initialize pre-batch of size 2 for Epoch 1

raceback (most recent call last): | 0/4 [00:00<?, ?it/s] File "train_rc.py", line 593, in main() File "train_rc.py", line 537, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "train_rc.py", line 222, in train outputs = model(**inputs) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 132, in forward start, end = self.embed_phrase(input_ids, attention_mask, token_type_ids) File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 94, in embed_phrase outputs = self.phrase_encoder( File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward attention_mask, input_shape, self.device File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device return next(self.parameters()).device StopIteration

opened by xixiaoyao 2

Releases(v1.1.0)

v1.1.0(Jan 18, 2022)
transformers==4.13.0, torch==1.10.1 support (automatic mixed precision, Trainer, etc)

easy install with environment.yml

pre-processing code refactoring

Source code(tar.gz)
Source code(zip)
v1.0.0(Dec 22, 2021)
First release of DensePhrases

Implements Lee et al., ACL 2021, Lee et al., EMNLP 2021

Source code(tar.gz)
Source code(zip)

Owner

Princeton Natural Language Processing

GitHub

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

1 Feb 1, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab

21 Nov 24, 2022

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

6 Jun 4, 2021

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

106 Dec 29, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 4, 2023

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

83 Jan 9, 2023

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

201 Nov 9, 2022

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

1.6k Dec 29, 2022

Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

3.2k Jan 4, 2023

Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

63 Nov 17, 2022

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

21 Dec 29, 2022

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

3 Apr 14, 2022

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

6.4k Jan 9, 2023

Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

1k Nov 17, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

ACL'2021: Learning Dense Representations of Phrases at Scale

Related tags

Overview

DensePhrases

Updates

Quick Links

Installation

Resources

1. Datasets

2. Pre-trained Models

3. Phrase Index

From 320GB to 74GB

Creating a Custom Phrase Index with DensePhrases

Playing with a DensePhrases Demo

DensePhrases: Training, Indexing and Inference

1. Training phrase and query encoders

2. Creating a phrase index

Parallelization

3. Query-side fine-tuning

IVFOPQ vs IVFSQ

4. Inference

Pre-processing

Questions?

Reference

License

Comments

Releases(v1.1.0)

v1.1.0(Jan 18, 2022)

v1.0.0(Dec 22, 2021)

Owner

Princeton Natural Language Processing

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

code for modular summarization work published in ACL2021 by Krishna et al

code for modular summarization work published in ACL2021 by Krishna et al

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Language-Agnostic SEntence Representations

Deduplication is the task to combine different representations of the same real world entity.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Unsupervised Language Modeling at scale for robust sentiment classification

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Ongoing research training transformer language models at scale, including: BERT & GPT-2