LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Hieu Duong

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

You might also like...

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Disentangled Representation Learning for Text-Video Retrieval This is a PyTorch implementation of the paper Disentangled Representation Learning for T

49 Dec 18, 2022

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

25 Oct 30, 2022

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

38 Oct 18, 2022

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma This repo provi

99 Dec 27, 2022

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

23 Dec 26, 2022

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022

Google Landmark Recogntion and Retrieval 2021 Solutions

Google Landmark Recogntion and Retrieval 2021 Solutions In this repository you can find solution and code for Google Landmark Recognition 2021 and Goo

5 Nov 25, 2022

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views.

4 Nov 19, 2022

Comments

Running run_phoBert.sh and run_predict.sh run into missing file ?

Em Chào Anh !

So I was trying to git clone this version and following your instructions exactly. After changing the file direction, I was hoping it will work, but I found myself 2 bug for 2 separate runs:

!bash /content/ZaloAI2021_LTR/run_predict.sh

and here is the trouble shoot: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/content/ZaloAI2021_LTR/predict.py", line 29, in load_model model = MODEL_CLASSES[args.model_type][1].from_pretrained(args.model_dir, args=args) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/checkpoint During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/content/ZaloAI2021_LTR/predict.py", line 170, in <module> predict(pred_config) File "/content/ZaloAI2021_LTR/predict.py", line 117, in predict model = load_model(pred_config, args, device) File "/content/ZaloAI2021_LTR/predict.py", line 34, in load_model raise Exception("Some model files might be missing...") Exception: Some model files might be missing...

The second one is trainning from scratch run_phobert.sh:

!bash /content/ZaloAI2021_LTR/run_phobert.sh

and here is the trouble shoot:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 12/27/2021 08:44:27 - INFO - data_loader - Creating features from dataset file at train_tokenize_clean.json Traceback (most recent call last): File "/content/ZaloAI2021_LTR/main.py", line 68, in <module> main(args_parse) File "/content/ZaloAI2021_LTR/main.py", line 12, in main train_dataset = load_and_cache_examples(args, tokenizer) File "/content/ZaloAI2021_LTR/data_loader.py", line 136, in load_and_cache_examples examples = create_examples(input_file) File "/content/ZaloAI2021_LTR/data_loader.py", line 36, in create_examples with open(input_file, "r", encoding='utf-8') as reader: FileNotFoundError: [Errno 2] No such file or directory: 'train_tokenize_clean.json'

From what I know, PhoBert of Vin AI dont have any directory that either name train_tokenize_clean.json or this link https://huggingface.co/api/models/checkpoint. My hypothesis is that I may have forgotten to install a library or something else, but I'm not sure what it is even after reading the code. So I'm here for help.

*Note: I'm running using google collab pro with gpu runtime

opened by DuyquanDuc 1

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

You might also like...

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

Google Landmark Recogntion and Retrieval 2021 Solutions

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Comments

Running run_phoBert.sh and run_predict.sh run into missing file ?

Em Chào Anh !

Owner

Hieu Duong

Zalo AI challenge 2021 task hum to song

Source code for Zalo AI 2021 submission

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"