This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Related tags

Deep Learning CORA
Overview

CORA

This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. Preptint. 2021.

cora_image

In this paper, we introduce CORA, a single, unified multilingual open QA model for many languages.
CORA consists of two main components: mDPR and mGEN.
mDPR retrieves documents from multilingual document collections and mGEN generates the answer in the target languages directly instead of using any external machine translation or language-specific retrieval module.
Our experimental results show state-of-the-art results across two multilingual open QA dataset: XOR QA and MKQA.

Contents

  1. Quick Run on XOR QA
  2. Overview
  3. Data
  4. Installation
  5. Training
  6. Evaluation
  7. Citations and Contact

Quick Run on XOR QA

We provide quick_start_xorqa.sh, with which you can easily set up and run evaluation on the XOR QA full dev set.

The script will

  1. download our trained mDPR, mGEN and encoded Wikipedia embeddings,
  2. run the whole pipeline on the evaluation set, and
  3. calculate the QA scores.

You can download the prediction results from here.

Overview

To run CORA, you first need to preprocess Wikipedia using the codes in wikipedia_preprocess.
Then you train mDPR and mGEN.
Once you finish training those components, please run evaluations, and then evaluate the performance using eval_scripts.

Please see the details of each components in each directory.

  • mDPR: codes for training and evaluating our mDPR.
  • mGEN: codes for training and evaluating our mGEN.
  • wikipedia_preprocess: codes for preprocessing Wikipedias.
  • eval_scripts: scripts to evaluate the performance.

Data

Training data

We will upload the final training data for mDPR. Please stay tuned!

Evaluation data

We evaluate our models performance on XOR QA and MKQA.

  • XOR QA Please download the XOR QA (full) data by running the command below.
mkdir data
cd data
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_full_v1_1.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_test_full_q_only_v1_1.jsonl
cd ..
  • MKQA Please download the original MKQA data from the original repository.
wget https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz
gunzip mkqa.jsonl.gz

Before evaluating on MKQA, you need to preprocess the MKQA data to convert them into the same format as XOR QA. Please follow the instructions at eval_scripts/README.md.

Installation

Dependencies

  • Python 3
  • PyTorch (currently tested on version 1.7.0)
  • Transformers (version 4.2.1; unlikely to work with a different version)

Trained models

You can download trained models by running the commands below:

mkdir models
wget https://nlp.cs.washington.edu/xorqa/cora/models/all_w100.tsv
wget https://nlp.cs.washington.edu/xorqa/cora/models/mGEN_model.zip
wget https://nlp.cs.washington.edu/xorqa/cora/models/mDPR_biencoder_best.cpt
unzip mGEN_model.zip
mkdir embeddings
cd embeddings
for i in 0 1 2 3 4 5 6 7;
do 
  wget https://nlp.cs.washington.edu/xorqa/cora/models/wikipedia_split/wiki_emb_en_$i 
done
for i in 0 1 2 3 4 5 6 7;
do 
  wget https://nlp.cs.washington.edu/xorqa/cora/models/wikipedia_split/wiki_emb_others_$i  
done
cd ../..

Training

CORA is trained with our iterative training process, where each iteration proceeds over two states: parameter updates and cross-lingual data expansion.

  1. Train mDPR with the current training data. For the first iteration, the training data is the gold paragraph data from Natural Questions and TyDi-XOR QA.
  2. Retrieve top documents using trained mDPR
  3. Train mGEN with retrieved data
  4. Run mGEN on each passages from mDPR and synthetic data retrieval to label the new training data.
  5. Go back to step 1.

overview_training

See the details of each training step in mDPR/README.md and mGEN/README.md.

Evaluation

  1. Run mDPR on the input data
python dense_retriever.py \
    --model_file ../models/mDPR_biencoder_best.cpt \
    --ctx_file ../models/all_w100.tsv \
    --qa_file ../data/xor_dev_full_v1_1.jsonl \
    --encoded_ctx_file "../models/embeddings/wiki_emb_*" \
    --out_file xor_dev_dpr_retrieval_results.json \
    --n-docs 20 --validation_workers 1 --batch_size 256 --add_lang
  1. Convert the retrieved results into mGEN input format
cd mGEN
python3 convert_dpr_retrieval_results_to_seq2seq.py \
    --dev_fp ../mDPR/xor_dev_dpr_retrieval_results.json \
    --output_dir xorqa_dev_final_retriever_results \
    --top_n 15 \
    --add_lang \
    --xor_engspan_train data/xor_train_retrieve_eng_span.jsonl \
    --xor_full_train data/xor_train_full.jsonl \
    --xor_full_dev data/xor_dev_full_v1_1.jsonl
  1. Run mGEN
CUDA_VISIBLE_DEVICES=0 python eval_mgen.py \
    --model_name_or_path \
    --evaluation_set xorqa_dev_final_retriever_results/val.source \
    --gold_data_path xorqa_dev_final_retriever_results/gold_para_qa_data_dev.tsv \
    --predictions_path xor_dev_final_results.txt \
    --gold_data_mode qa \
    --model_type mt5 \
    --max_length 20 \
    --eval_batch_size 4
cd ..
  1. Run the XOR QA full evaluation script
cd eval_scripts
python eval_xor_full.py --data_file ../data/xor_dev_full_v1_1.jsonl --pred_file ../mGEN/xor_dev_final_results.txt --txt_file

Baselines

In our paper, we have tested several baselines such as Translate-test or multilingual baselines. The codes for machine translations or BM 25-based retrievers are at baselines. To run the baselines, you may need to download code and mdoels from the XOR QA repository. Those codes are implemented by Velocity :)

Citations and Contact

If you find this codebase is useful or use in your work, please cite our paper.

@article{
asai2021cora,
title={One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval},
author={Akari Asai and Xinyan Yu and Jungo Kasai and Hannaneh Hajishirzi},
journal={Arxiv Preprint},
year={2021}
}

Please contact Akari Asai (@AkariAsai on Twitter, akari[at]cs.washington.edu) for questions and suggestions.

You might also like...
Official implementation of our paper
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

An official implementation of
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Official code implementation for
Official code implementation for "Personalized Federated Learning using Hypernetworks"

Personalized Federated Learning using Hypernetworks This is an official implementation of Personalized Federated Learning using Hypernetworks paper. [

StyleGAN2 - Official TensorFlow Implementation
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

 Old Photo Restoration (Official PyTorch Implementation)
Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

Official implementation of
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

Official PyTorch implementation of Spatial Dependency Networks.
Official PyTorch implementation of Spatial Dependency Networks.

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling Đorđe Miladinović   Aleksandar Stanić   Stefan Bauer   Jürgen Schmid

Comments
  • Hardware requirement to build a FAISS index from the dense embedding?

    Hardware requirement to build a FAISS index from the dense embedding?

    Hi,

    First of all, thank you for sharing the great work!

    I'm trying to replicate the implementation on another platform (ParlAI) and I have been building a FAISS index from the multilingual embeddings provided in this repo on a GPU with 50 GB vram. I used the FAISS compression techniques so it's supposed to build an index much faster. However, the program has been running for ~48 hours now and not stopping. I've also tried building it without the compression techniques but that took days and the cost became to high for me.

    I'm wondering what is the hardware requirement to build such an index and how long would it take? Also would it be possible to share the index file you built?

    Thank you so much!

    opened by evelynkyl 2
  • Adding the language id twice to the question before passing it to mGEN

    Adding the language id twice to the question before passing it to mGEN

    Hi,

    Thanks for uploading CORA on github! I am trying to use your package in my project, and wanted to make sure if it was the intention of the authors to add the language_id twice to the outputs of the mDPR before passing it to mGEN.

    In mDPR/dense_retriever.py, in the method parse_qa_jsonlines_file, the 2 - letter language id is added to the question while encoding the question for mDPR. Considering this was intentional, the 2 letter language id is added again while converting mDPR outputs to seq2seq (here)

    What ends up happening is that before the input sequence is sent into mGEN, the question ID is appended in the end by the language id twice, both being the same. We would follow the same format if the authors intended it to be so.

    opened by styx97 0
Owner
Akari Asai
PhD student at @uwnlp . NLP & ML.
Akari Asai
Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

null 101 Nov 25, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

Angtian Wang 76 Nov 23, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

NVIDIA Research Projects 3.2k Dec 30, 2022
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 6, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022