Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

THUNLP

Last update: Nov 2, 2022

Related tags

Deep Learning ERICA

Overview

ERICA

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

The code is based on huggingface's transformers, the trained models and pre-training data can be downloaded from Google Drive.

Quick Start

You can quickly run our code by following steps:

Install dependencies as described in following section.
cd to pretrain or finetune directory then download and pre-process data for pre-training or finetuning.

1. Dependencies

Run the following script to install dependencies.

pip install -r requirement.txt

You need to install transformers and apex manually.

transformers We use huggingface transformers to implement Bert and RoBERTa, and the version is 2.5.0. For convenience, we have downloaded transformers into code/pretrain/ so you can easily import it, and we have also modified some lines in the class BertForMaskedLM in src/transformers/modeling_bert.py while keeping the other codes unchanged.

You just need run

pip install .

to install transformers manually.

apex Install apex under the offical guidance.

process pretraining data

In folder prepare_pretrain_data, we provide the codes for processing pre-training data.

2. Pretraining

To pretrain ERICA_bert:

cd code/pretrain

python -m torch.distributed.launch --nproc_per_node 8  main.py  \
    --model DOC  --lr 3e-5 --batch_size_per_gpu 16 --max_epoch 105  \
    --gradient_accumulation_steps 16    --save_step 500  --temperature 0.05  \
    --train_sample  --save_dir ckpt_doc_dw_f_alpha_1_uncased --n_gpu 8  --debug 1  --add_none 1 \
    --alpha 1 --flow 0 --dataset_name none.json  --wiki_loss 1 --doc_loss 1 \
    --change_dataset 1  --start_end_token 0 --bert_model bert \
    --pretraining_size -1 --ablation 0 --cased 0

some explanations for hyper-parameters: temperature (\tau used in loss function of contrastive learning); debug (whether to debug (we provide an example_debug file for pre-training); add_none (whether to add no_relation pair in RD loss); alpha (the proportion of masking (1 means no masking, in experiments, we find masking is not helpful as is described in the main paper, so for all models, we do not mask in the pre-training phase. However, we leave this function here for further research explorations.)); flow (if masking, whether to use a linear decay); wiki_loss (whether to add ED loss); doc_loss (whether to add RD loss); start_end_token (use another entity encoding method); cased (whether to use cased version of BERT).

3. Fine-tuning

Enter each folder for downstream task (document-level / sentence-level relation extraction, entity typing and question answering) fine-tuning. Before fine-tuning, we assume you have already pre-trained an ERICA model. Excecute the bash in each folder for reimplementation.

Comments

Problem about finetune the DocRED dataset

| epoch 19 | step 9700 | ms/b 35103.27 | train loss 0.00095783 | NA acc: 0.99 | not NA acc: 0.95 | tot acc: 0.98 | epoch 19 | step 9750 | ms/b 1007.05 | train loss 0.00080489 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9800 | ms/b 1034.59 | train loss 0.00080026 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9850 | ms/b 1064.13 | train loss 0.00079512 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9900 | ms/b 1235.63 | train loss 0.00084200 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9950 | ms/b 1050.25 | train loss 0.00081005 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10000 | ms/b 1057.81 | train loss 0.00076923 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10050 | ms/b 1031.88 | train loss 0.00076228 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10100 | ms/b 1010.93 | train loss 0.00079324 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10150 | ms/b 1004.57 | train loss 0.00081850 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98

dev set evaluation ALL : Theta 0.9079 | F1 0.5720 | AUC 0.5670 Ignore ma_f1 0.5501 | input_theta 0.9079 test_result F1 0.5494 | AUC 0.5376 test set evaluation ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000 Ignore ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000

Finish training Best epoch = 19

I fine-tunning the DocRED dataset according to the guidelines, I set num_train_epochs as 20. But I got the results : the F1 validation set is 0.5494, but all metrics of test set are 0.

so, why all metrics of test set are 0 ?

opened by WenxiongLiao 3
如果一个文本中有多个正例，是咋处理的？

您好，我想请问下，论文中“Entity Discrimination”和“Relation Discrimination”任务，采用对比学习的方式训练。以“Entity Discrimination”任务为例，假设有一句话：A和B共同创建了公司C。本句话有两个三元组（C，founded by, A）和（C，founded by, B），那么对于(C，founded by,)，A和B都是正例。但我看到对比学习的那个公式，好像每次文本中的正例只有一个。不知道我理解的是否正确，希望得到您的回答，谢谢。

opened by puzzledTao 3

sentenceRE ‘s finetune codes can't reproduce successfully

Hi! here is a problem and need your help After run 'bash run.sh' ，the code suspended in main.py [line 125]

loss, output = model(**inputs)

for a long time

Namespace(adam_epsilon=1e-08, batch_size_per_gpu=32, ckpt_to_load='../pretrain/ckpt/ERICA_bert_uncased_RP', cuda='0', dataset='tacred', encoder='bert', entity_marker=True, gpu=device(type='cuda'), hidden_size=768, lr=3e-05, max_epoch=8, max_grad_norm=1, max_length=100, mode='CM', optim='adamw', seed=42, train_prop=1.0, warmup_steps=500, weight_decay=1e-05)
Use all train data!
pre process train.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process dev.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
pre process test.txt
The number of sentence in which tokenizer can't find head/tail entity is 0
********* load from ckpt/../pretrain/ckpt/ERICA_bert_uncased_RP ***********
successful load ckpt
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Begin train...
We will train model in 17032 steps

i tried to delete the specific [main.py/line 103] 'model = nn.DataParallel(model)' and add 'model.cuda()' .but get error:

Traceback (most recent call last):
  File "main.py", line 311, in <module>
    train(args, model, train_dataloader, dev_dataloader, test_dataloader)
  File "main.py", line 131, in train
    loss, output = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/ERICA/finetune/Sent_level_RE/code/re/model.py", line 36, in forward
    outputs = self.bert(input_ids, mask)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 753, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 178, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Is it possible that the batch data[main.py/line:114] is stored in the CPU while the model is stored in the GPU?

opened by zweny 2

Where did you modified src/transformers/modeling_bert.py?

I expected that file src/transformers/modeling_bert.py would add Loss_ED and Loss_RD, but I didn't find the relevant code? In which file did you add Loss_ED and Loss_RD?

opened by WenxiongLiao 1
scripts for preparing pre-train data

Hi, Thank you for your great work. I am trying to work on reproducing the pretraining data, and possibly extending it to different languages. May I know how do you extract all_triples and all_qs from the wikipedia dump (latest-all.json)?

opened by tonytan48 1
Loss for entities

Thanks for your interesting work!

When I went through the code, I didn't find the data preparation and loss for entities (pretraining) as described in the paper. Could you help me find the related codes? Thanks.

opened by JiachengLi1995 0

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Related tags

Overview

ERICA

Quick Start

1. Dependencies

process pretraining data

2. Pretraining

3. Fine-tuning

Comments

Problem about finetune the DocRED dataset

I fine-tunning the DocRED dataset according to the guidelines, I set num_train_epochs as 20. But I got the results : the F1 validation set is 0.5494, but all metrics of test set are 0.

so, why all metrics of test set are 0 ?

如果一个文本中有多个正例，是咋处理的？

sentenceRE ‘s finetune codes can't reproduce successfully

Where did you modified src/transformers/modeling_bert.py?

scripts for preparing pre-train data

Loss for entities

Owner

THUNLP

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

This is the dataset and code release of the OpenRooms Dataset.

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)