Source code for Zalo AI 2021 submission

Last update: Dec 27, 2022

Related tags

Deep Learning zalo_ltr_2021

Overview

zalo_ltr_2021

Source code for Zalo AI 2021 submission

Solution:

Pipeline

We use the pipepline in the picture below:

Our pipeline is combination of BM25 and Sentence Transfromer. Let us describe our approach briefly:

Step 1: We trained a BM25 model for searching similar pair. We used BM25 to create negative sentence pairs for training Sentence Transformer in Step 3.
Step 1: We trained Masked Language Model using legal corpus from training data. Our masked languague models are

VinAI/PhoBert-Large
FPTAI/ViBert

Step 3: Train Sentence Transformer + Contrative loss with 4 settings:

1. MLM PhoBert Large -> Sentence Transformer 
2. MLM ViBert -> Sentence Transformer
3. MLM PhoBert Large -> Condenser -> Sentence Transformer
4. MLM PhoBert Large -> Co-Condenser -> Sentence Transformer

Step 4: Using 4 models from step 3 to generate corresponding hard negative sentences for training round 2 in step 5.
Step 5: Training 4 above models round 2.
Step 5: Ensemble 4 models obtained from step 5.

Data

Raw data is in zac2021-ltr-data

Create Folder

Create a new folder for generated data for training mkdir generated_data

Train BM 25

To train BM25: python bm25_train.py Use load_docs to save time for later run: python bm25_train.py --load_docs

To evaluate: python bm25_create_pairs.py This step will also create top_k negative pairs from BM25. We choose top_k= 20, 50 Pairs will be saved to: pair_data/

These pairs will be used to train round 1 Sentence Transformer model

Create corpus:

Run python create_corpus.txt This step will create:

corpus.txt (for finetune language model)
cocondenser_data.json (for finetune CoCondenser model)

Finetune language model using Huggingface

Pretrained model:

viBERT: FPTAI/vibert-base-cased
vELECTRA: FPTAI/velectra-base-discriminator-cased
phobert-base: vinai/phobert-base
phobert-large: vinai/phobert-large

$MODEL_NAME= phobert-large $DATA_FILE= corpus.txt $SAVE_DIR= /path/to/your/save/directory

Run the following cmd to train Masked Language Model:

python run_mlm.py \
    --model_name_or_path $MODEL_NAME \
    --train_file $DATA_FILE \
    --do_train \
    --do_eval \
    --output_dir $SAVE_DIR \
    --line_by_line \
    --overwrite_output_dir \
    --save_steps 2000 \
    --num_train_epochs 20 \
    --per_device_eval_batch_size 32 \
    --per_device_train_batch_size 32

Train condenser and cocondenser from language model checkpoint

Original source code here: https://github.com/luyug/Condenser (we modified several lines of code to make it compatible with current version of transformers)

Create data for Condenser:

python helper/create_train.py --tokenizer_name $MODEL_NAME --file $DATA_FILE --save_to $SAVE_CONDENSER \ --max_len $MAX_LENGTH 

$MODEL_NAME=vinai/phobert-large
$MAX_LENGTH=256
$DATA_FILE=../generated_data/corpus.txt
$SAVE_CONDENSER=../generated_data/

$MODEL_NAME checkpoint from finetuned language model

python run_pre_training.py \
  --output_dir $OUTDIR \
  --model_name_or_path $MODEL_NAME \
  --do_train \
  --save_steps 2000 \
  --per_device_train_batch_size $BATCH_SIZE \
  --gradient_accumulation_steps $ACCUMULATION_STEPS \
  --fp16 \
  --warmup_ratio 0.1 \
  --learning_rate 5e-5 \
  --num_train_epochs 8 \
  --overwrite_output_dir \
  --dataloader_num_workers 32 \
  --n_head_layers 2 \
  --skip_from 6 \
  --max_seq_length $MAX_LENGTH \
  --train_dir $SAVE_CONDENSER \
  --weight_decay 0.01 \
  --late_mlm

We use this setting to run Condenser:

python run_pre_training.py   \
    --output_dir saved_model_1/  \
    --model_name_or_path ../Legal_Text_Retrieval/lm/large/checkpoint-30000   \
    --do_train   
    --save_steps 2000   \
    --per_device_train_batch_size 32   \
    --gradient_accumulation_steps 4   \
    --fp16   \
    --warmup_ratio 0.1   \
    --learning_rate 5e-5   \
    --num_train_epochs 8   \
    --overwrite_output_dir   \
    --dataloader_num_workers 32   \
    --n_head_layers 2   \
    --skip_from 6   \
    --max_seq_length 256   \
    --train_dir ../generated_data/   \
    --weight_decay 0.01   \
    --late_mlm

Train cocodenser:

First, we create data for cocodenser

python helper/create_train_co.py \
    --tokenizer vinai/phobert-large \
    --file ../generated_data/cocondenser/corpus.txt.json \
    --save_to data/large_co/corpus.txt.json \

Run the following cmd to train co-condenser model:

python  run_co_pre_training.py   \
    --output_dir saved_model/cocondenser/   \
    --model_name_or_path $CODENSER_CKPT   \
    --do_train   \
    --save_steps 2000   \
    --model_type bert   \
    --per_device_train_batch_size 32   \
    --gradient_accumulation_steps 1   \
    --fp16   \
    --warmup_ratio 0.1   \
    --learning_rate 5e-5   \
    --num_train_epochs 10   \
    --dataloader_drop_last   \
    --overwrite_output_dir   \
    --dataloader_num_workers 32   \
    --n_head_layers 2   \
    --skip_from 6   \
    --max_seq_length 256   \
    --train_dir ../generated_data/cocondenser/   \
    --weight_decay 0.01   \
    --late_mlm  \
    --cache_chunk_size 32 \
    --save_total_limit 1

Train Sentence Transformer

Round 1: using negative pairs of sentence generated from BM25

For each Masked Language Model, we trained a sentence transformer corresponding to it Run the following command to train round 1 of sentence bert model

Note: Use cls_pooling for condenser and cocodenser

python train_sentence_bert.py 
    --pretrained_model /path/to/your/pretrained/mlm/model\
    --max_seq_length 256 \
    --pair_data_path /path/to/your/negative/pairs/data\
    --round 1 \
    --num_val $NUM_VAL\
    --epochs 10\
    --saved_model /path/to/your/save/model/directory\
    --batch_size 32\

here we pick $NUM_VAL is 50 * 20 and 50 * 50 for top 20 and 50 pairs data respectively

Round 2: using hard negative pairs create from Round 1 model

Step 1: Run the following cmd to generate hard negative pairs from round 1 model:

python hard_negative_mining.py \
    --model_path /path/to/your/sentence/bert/model\
    --data_path /path/to/the/lagal/corpus/json\
    --save_path /path/to/directory/to/save/neg/pairs\
    --top_k top_k_negative_pair

Here we pick top k is 20 and 50.

Use the data generated from step 1 to train round 2 of sentence bert model for each model from round 1: To train round 2, please use the following command:

python train_sentence_bert.py 
    --pretrained_model /path/to/your/pretrained/mlm/model\
    --max_seq_length 256 \
    --pair_data_path /path/to/your/negative/pairs/data\
    --round 2 \
    --num_val $NUM_VAL\
    --epochs 5\
    --saved_model /path/to/your/save/model/directory\
    --batch_size 32\

Tips: Use small learning rate for model convergence

Prediction

For reproducing result.

To get the prediction, we use 4 2-round trained models with mlm pretrained is Large PhoBert, PhoBert-Large-Condenser, Pho-Bert-Large-CoCondenser and viBert-based. Final models and their corresponding weights are below:

1 x PhoBert-Large-Round2: 0.1
1 x Condenser-PhoBert-Large-round2: 0.3
1 x Co-Condenser-PhoBert-Large-round2: 0.4
1 x FPTAI/ViBert-base-round2: 0.2

doc_refers_saved.pkl and legal_dict.json are generated in traning bm25 process and create corpus, respectively. We also provide a file to re-generate it before inference.

python3 create_corpus.py --data zac2021-ltr-data --save_dir generated_data
python3 create_doc_refers.py --raw_data zac2021-ltr-data --save_path generated_data

We also provide embedding vectors which is pre-encoded by ensemble model in encoded_legal_data.pkl. If you want to verified and get the final submission, please run the following command:

python3 predict.py --data /path/to/test/json/data --legal_data generated_data/doc_refers_saved.pkl --precode

If you already have encoded_legal_data.pkl, run the following command:

python3 predict.py --data /path/to/test/json/data --legal_data generated_data/doc_refers_saved.pkl

Just for inference

Run the following command

chmod +x predict.sh
./predict.sh

post-processing techniques:

fix typo of nd-cp
multiply cos-sim score with score from bm25, we pick score-range = [max-score - 2.6, max-score] and pick top 5 sentences for a question with multiple answers .

Methods used but not work

Training Round 3 for Sentence Transformer.
Pseudo Label: Improve our single model performace but hurt ensembel preformance.

Contributors:

Thanks our teamates for great works: Dzung Le, Hong Nguyen

Comments

Reason to choose Contrastive Loss
em chào các anh ạ, em có đọc code của các anh về bài toán retrieval Văn bản pháp luật của zalo 2021, em thấy các anh sử dụng Contrastive Loss ạ. Em có một số câu hỏi mong được chỉ giáo về kinh nghiệm làm bài toán retrieval ạ:

Vì sao lại chọn Contrastive Loss mà không phải các loss khác ạ ? Vì sao các anh không training theo kiểu DPR (Dense Passage Retrieval), sự khác nhau và khi nào nên dùng DPR và khi nào nên dùng Contrastive Learning ạ ?

Phương pháp này có phụ thuộc vào số negative samples không ạ (k đang có giá trị 20), nếu k=100 hoặc cao hơn có thể training như cũ được không ?

Em cảm ơn và mong được hồi đáp ạ
opened by duongkstn 2

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

56 Nov 15, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

171 Nov 23, 2022

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval This repository contains source code and pre-trained/fine-tun

65 Dec 26, 2022

[WWW 2021] Source code for "Graph Contrastive Learning with Adaptive Augmentation"

GCA Source code for Graph Contrastive Learning with Adaptive Augmentation (WWW 2021) For example, to run GCA-Degree under WikiCS, execute: python trai

Big Data and Multi-modal Computing Group, CRIPAC

97 Jan 7, 2023

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MusCaps: Generating Captions for Music Audio Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1 1 Queen Mary University of London, 2

57 Dec 7, 2022

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

146 Nov 29, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

Riggable 3D Face Reconstruction via In-Network Optimization Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimizati

130 Jan 2, 2023

Source code for Zalo AI 2021 submission

Related tags

Overview

zalo_ltr_2021

Solution:

Pipeline

Data

Create Folder

Train BM 25

Create corpus:

Finetune language model using Huggingface

Train condenser and cocondenser from language model checkpoint

Create data for Condenser:

Train cocodenser:

Train Sentence Transformer

Round 1: using negative pairs of sentence generated from BM25

Round 2: using hard negative pairs create from Round 1 model

Prediction

For reproducing result.

Just for inference

post-processing techniques:

Methods used but not work

Contributors:

You might also like...

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

[WWW 2021] Source code for "Graph Contrastive Learning with Adaptive Augmentation"

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

Comments

Reason to choose Contrastive Loss

Owner

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Zalo AI challenge 2021 task hum to song

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

A supplementary code for Editable Neural Networks, an ICLR 2020 submission.

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

(under submission) Bayesian Integration of a Generative Prior for Image Restoration

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

Submission to Twitter's algorithmic bias bounty challenge

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI