For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Last update: Feb 9, 2022

Related tags

Miscellaneous Cross-Encoder-with-Bi-Encoder

Overview

Cross-Encoder-with-Bi-Encoder

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Data

Data used by "Open-Domain Question Answering Competition" hosted by Aistages, and copyrights can be used under CC-BY-2.0.

+- data
|   +- train_dataset
    |   +- train
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- validataion
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- test_dataset
    |   +- validation
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- wikipedia_documents.json

Wikipedia data can be uploaded to the folder location above and used.

!git clone https://github.com/jjonhwa/Cross-Encoder-with-Bi-Encoder.git # git clone
% cd ./Cross-Encoder-with-Bi-Encoder/_data                              # change directory (input your own path)

!gdown --id 1O-kxt4DupOibNhkwmg3luTLt07faRgvO # wiki data upload        # download wikipedia data

Setup

Dependencies

datasets==1.5.0
transformers==4.5.0
tqdm==4.41.1
pandas==1.1.4
CUDA==11.0

Install Requirements

bash install_requirements.sh

Hardware

GPU : Tesla V100 (32GB)

Checkpoint

You can check the code in the Colab environment using Demo.
It does not work in Colab Basic.

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

Sparse Embedding may be better in tasks for viewing Passage and creating a question (if there is an annotation bias), such as SQuAD.
- A briefly summarized it in Korean. -> Dense Passage Retrieval for Open Domain Question Answering Review (3. Passage Retrieval - Main Results)
In most other data, documents can be extracted with higher accuracy if Dense Passage Retreat is used.

2. Sparse Embedding & Dense Embedding

Most of the content was knowledge obtained by referring to Paper, and based on this, it led to improvement in Retriever performance.
Prior to the application of DPR, in the case of 'KLUE MRC database' in which datasets were configured in the same manner as SQuAD, it would be better to utilize techniques such as Sparse embedding technique BM25 compared to DPR.
Actually, until ReRank Strategy was applied, the highest performance was achieved with elastic search based on BM25.
When only biencoder was used, Retrieval accuracy was far below elastic search in the 'KLUE MRC competition'
Retrieval Accuracy in our Data

	Top-5	Top-50	Top-100
Elastic Search	0.852	0.945	0.962
DPR Bi-Encoder	-	0.775	0.85

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

Our purpose is to bring high performance from KLUE MRC competition to End-to-End from Retrieval to Reader. From this, the ReRank strategy using Cross Encoder was used.
In addition, when implementing Cross Encoder, the key point is to extract a negative sample within Batch and use it to calculate loss.
After extracting the Retrival Passage of the Top-500 using the Bi-Encoder, only a small number of Passages are finally extracted by returning to the Cross Encoder.
Retrieval Accuracy in our Data

	Top-5	Top-50	Top-100
Elastic Search	0.852	0.945	0.962
DPR without CrossEncoder	-	0.775	0.85
DPR with CrossEncoder	0.825	0.95	-

4. Ensemble

In this process, the contents of CrossEncoder were mainly written, and the contents of Ensemble were omitted.
An experiment was conducted assuming that performance improvement would be achieved from different types of Retrival combinations by conducting Ensemble using Sparse Embedding and Dense Embedding.
Top-100 was selected using Elastic Search and Top-100 was selected using DPR and Cross Encoder, and the final output score was calculated by combining them 1 to 1 and normalizing them.
When the final Reader model was tested, when Top-5 was input, the performance was the best, so the experiment was conducted after limiting the number of passages to be returned to five.
Actually, the performance has improved significantly, and the retrival accuracy is as follows.

	Top-5	Top-50	Top-100
Elastic Search	0.852	0.945	0.962
DPR with CrossEncoder	0.825	0.95	-
Ensemble	0.9082	-	-

Train CrossEncoder & BiEncoder

Learn crossencoder and biencoder and store them.
Modify only the data path to match your data. (find "your_dataset_path")

python train.py --encoder 'cross' --output_directory './save_directory/'

python train.py --encoder 'bi' --output_directory './save_directory/'

Run ReRank

It precedes creating an encoder using crossencoder and biencoder. (Before Run ReRank, you have to run 'train.py' to make)
Modify only the data path to match your data. (find "your_dataset_path")

python rerank.py --input_directory './save_directory/'

Run Retriever Demo

Top 500 Passages are Retrieved from about 60000 data using Biencoder, and Top 5 is finally retrieved using CrossEncoder.
Passage Embedding about wiki data, Cross Encoder and Bi-Encoder can be downloaded and utilized

Virtual webcam that takes real webcam footage and replaces the background in order to have Virtual Backgrounds in MS Teams for Linux where the feature is unimplemented.

Background Remover The Need It's been good long while since Microsoft first released a Teams version for Linux and yet, one of Teams' coolest features

80 Dec 20, 2022

Rates how pog a word or user is. Not random and does have some kind of algorithm to it.

PogRater :D Rates how pogchamp a word is :D A fun project coded by JBYT27 using Python3 Have you ever wondered how pog a word is? Well, congrats, you

2 Jun 25, 2022

This repo will have a small amount of Chrome tools that can be used for DFIR, Hacking, Deception, whatever your heart desires.

Chrome-Tools Overview Welcome to the repo. This repo will have a small amount of Chrome tools that can be used for DFIR, Hacking, Deception, whatever

5 Jun 8, 2022

The worst and slowest programming language you have ever seen

VenumLang this is a complete joke EXAMPLE: fizzbuzz in venumlang x = 0

7 Mar 12, 2022

This is a fork of the BakeTool with some improvements that I did to have better workflow.

blender-bake-tool This is a fork of the BakeTool with some improvements that I did to have better workflow. 99.99% of work was done by BakeTool team.

3 Oct 4, 2022

Just imagine normal bancho, but you can have multiple profiles and funorange speed up maps ranked

Local osu! server Just imagine normal bancho, but you can have multiple profiles and funorange speed up maps ranked (coming soon)! Windows Setup Insta

25 Nov 15, 2022

A simple program to recolour simple png icon-like pictures with just one colour + transparent or white background. Resulting images all have transparent background and a new colour.

0 Jan 30, 2022

A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt

A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt and then runs the main.py script, and it will output what recipes can be created based upon the ingredients supported.

3 Nov 2, 2022

Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a local folder

Ingestinator Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a

2 Nov 18, 2022

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Related tags

Overview

Cross-Encoder-with-Bi-Encoder

Data

Setup

Dependencies

Install Requirements

Hardware

Checkpoint

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

2. Sparse Embedding & Dense Embedding

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

4. Ensemble

Train CrossEncoder & BiEncoder

Run ReRank

Run Retriever Demo

You might also like...

Virtual webcam that takes real webcam footage and replaces the background in order to have Virtual Backgrounds in MS Teams for Linux where the feature is unimplemented.

Rates how pog a word or user is. Not random and does have some kind of algorithm to it.

This repo will have a small amount of Chrome tools that can be used for DFIR, Hacking, Deception, whatever your heart desires.

The worst and slowest programming language you have ever seen

This is a fork of the BakeTool with some improvements that I did to have better workflow.

Just imagine normal bancho, but you can have multiple profiles and funorange speed up maps ranked

A simple program to recolour simple png icon-like pictures with just one colour + transparent or white background. Resulting images all have transparent background and a new colour.

A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt

Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a local folder

Owner

Using Python to parse through email logs received through several backup systems.

Simple GUI menu for micropython using a rotary encoder and basic display.

You can easily send campaigns, e-marketing have actually account using cash will thank you for using our tools, and you can support our Vodafone Cash +201090788026

python scripts and other files to generate induction encoder PCBs in Kicad

Cross-Encoder-with-Bi-Encoder를 활용한 WebPage 데모

Islam - This is a simple python script.In this script I have written all the suras of Al Quran. As a result, by using this script, you can know the number of any sura at the moment.

This is a multi-app executor that it used when we have some different task in a our applications and want to run them at the same time

Integer sets where all subsets have unique sums

Enhanced version of blender's bvh add-on with more settings supported. The bvh's rest pose should have the same handedness as the armature while could use a different up/forward definiton.

Here, I have discuss the three methods of list reversion. The three methods are built-in method, slicing method and position changing method.

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Related tags

Overview

Cross-Encoder-with-Bi-Encoder

Data

Setup

Dependencies

Install Requirements

Hardware

Checkpoint

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

2. Sparse Embedding & Dense Embedding

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

4. Ensemble

Train CrossEncoder & BiEncoder

Run ReRank

Run Retriever Demo

You might also like...

Virtual webcam that takes real webcam footage and replaces the background in order to have Virtual Backgrounds in MS Teams for Linux where the feature is unimplemented.

Rates how pog a word or user is. Not random and does have *some* kind of algorithm to it.

This repo will have a small amount of Chrome tools that can be used for DFIR, Hacking, Deception, whatever your heart desires.

The worst and slowest programming language you have ever seen

This is a fork of the BakeTool with some improvements that I did to have better workflow.

Just imagine normal bancho, but you can have multiple profiles and funorange speed up maps ranked

A simple program to recolour simple png icon-like pictures with just one colour + transparent or white background. Resulting images all have transparent background and a new colour.

A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt

Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a local folder

Owner

Using Python to parse through email logs received through several backup systems.

Simple GUI menu for micropython using a rotary encoder and basic display.

You can easily send campaigns, e-marketing have actually account using cash will thank you for using our tools, and you can support our Vodafone Cash +201090788026

python scripts and other files to generate induction encoder PCBs in Kicad

Cross-Encoder-with-Bi-Encoder를 활용한 WebPage 데모

Islam - This is a simple python script.In this script I have written all the suras of Al Quran. As a result, by using this script, you can know the number of any sura at the moment.

This is a multi-app executor that it used when we have some different task in a our applications and want to run them at the same time

Integer sets where all subsets have unique sums

Enhanced version of blender's bvh add-on with more settings supported. The bvh's rest pose should have the same handedness as the armature while could use a different up/forward definiton.

Here, I have discuss the three methods of list reversion. The three methods are built-in method, slicing method and position changing method.

Rates how pog a word or user is. Not random and does have some kind of algorithm to it.