Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Related tags

Deep Learning ERICA
Overview

ERICA

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

The code is based on huggingface's transformers, the trained models and pre-training data can be downloaded from Google Drive.

Quick Start

You can quickly run our code by following steps:

  • Install dependencies as described in following section.
  • cd to pretrain or finetune directory then download and pre-process data for pre-training or finetuning.

1. Dependencies

Run the following script to install dependencies.

pip install -r requirement.txt

You need to install transformers and apex manually.

transformers We use huggingface transformers to implement Bert and RoBERTa, and the version is 2.5.0. For convenience, we have downloaded transformers into code/pretrain/ so you can easily import it, and we have also modified some lines in the class BertForMaskedLM in src/transformers/modeling_bert.py while keeping the other codes unchanged.

You just need run

pip install .

to install transformers manually.

apex Install apex under the offical guidance.

process pretraining data

In folder prepare_pretrain_data, we provide the codes for processing pre-training data.

2. Pretraining

To pretrain ERICA_bert:

cd code/pretrain

python -m torch.distributed.launch --nproc_per_node 8  main.py  \
    --model DOC  --lr 3e-5 --batch_size_per_gpu 16 --max_epoch 105  \
    --gradient_accumulation_steps 16    --save_step 500  --temperature 0.05  \
    --train_sample  --save_dir ckpt_doc_dw_f_alpha_1_uncased --n_gpu 8  --debug 1  --add_none 1 \
    --alpha 1 --flow 0 --dataset_name none.json  --wiki_loss 1 --doc_loss 1 \
    --change_dataset 1  --start_end_token 0 --bert_model bert \
    --pretraining_size -1 --ablation 0 --cased 0

some explanations for hyper-parameters: temperature (\tau used in loss function of contrastive learning); debug (whether to debug (we provide an example_debug file for pre-training); add_none (whether to add no_relation pair in RD loss); alpha (the proportion of masking (1 means no masking, in experiments, we find masking is not helpful as is described in the main paper, so for all models, we do not mask in the pre-training phase. However, we leave this function here for further research explorations.)); flow (if masking, whether to use a linear decay); wiki_loss (whether to add ED loss); doc_loss (whether to add RD loss); start_end_token (use another entity encoding method); cased (whether to use cased version of BERT).

3. Fine-tuning

Enter each folder for downstream task (document-level / sentence-level relation extraction, entity typing and question answering) fine-tuning. Before fine-tuning, we assume you have already pre-trained an ERICA model. Excecute the bash in each folder for reimplementation.

Comments
  • Problem about finetune the DocRED dataset

    Problem about finetune the DocRED dataset

    | epoch 19 | step 9700 | ms/b 35103.27 | train loss 0.00095783 | NA acc: 0.99 | not NA acc: 0.95 | tot acc: 0.98 | epoch 19 | step 9750 | ms/b 1007.05 | train loss 0.00080489 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9800 | ms/b 1034.59 | train loss 0.00080026 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9850 | ms/b 1064.13 | train loss 0.00079512 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9900 | ms/b 1235.63 | train loss 0.00084200 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 9950 | ms/b 1050.25 | train loss 0.00081005 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10000 | ms/b 1057.81 | train loss 0.00076923 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10050 | ms/b 1031.88 | train loss 0.00076228 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10100 | ms/b 1010.93 | train loss 0.00079324 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98 | epoch 19 | step 10150 | ms/b 1004.57 | train loss 0.00081850 | NA acc: 0.99 | not NA acc: 0.96 | tot acc: 0.98

    dev set evaluation ALL : Theta 0.9079 | F1 0.5720 | AUC 0.5670 Ignore ma_f1 0.5501 | input_theta 0.9079 test_result F1 0.5494 | AUC 0.5376 test set evaluation ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000 Ignore ma_f1 0.0000 | input_theta 0.9079 test_result F1 0.0000 | AUC 0.0000

    Finish training Best epoch = 19

    I fine-tunning the DocRED dataset according to the guidelines, I set num_train_epochs as 20. But I got the results : the F1 validation set is 0.5494, but all metrics of test set are 0.

    so, why all metrics of test set are 0 ?

    opened by WenxiongLiao 3
  • 如果一个文本中有多个正例,是咋处理的?

    如果一个文本中有多个正例,是咋处理的?

    您好,我想请问下,论文中“Entity Discrimination”和“Relation Discrimination”任务,采用对比学习的方式训练。以“Entity Discrimination”任务为例,假设有一句话:A和B共同创建了公司C。本句话有两个三元组(C,founded by, A)和 (C,founded by, B),那么对于(C,founded by,),A和B都是正例。但我看到对比学习的那个公式,好像每次文本中的正例只有一个。不知道我理解的是否正确,希望得到您的回答,谢谢。

    opened by puzzledTao 3
  • sentenceRE ‘s finetune codes can't reproduce successfully

    sentenceRE ‘s finetune codes can't reproduce successfully

    Hi! here is a problem and need your help After run 'bash run.sh' ,the code suspended in main.py [line 125]

    loss, output = model(**inputs)

    for a long time

    Namespace(adam_epsilon=1e-08, batch_size_per_gpu=32, ckpt_to_load='../pretrain/ckpt/ERICA_bert_uncased_RP', cuda='0', dataset='tacred', encoder='bert', entity_marker=True, gpu=device(type='cuda'), hidden_size=768, lr=3e-05, max_epoch=8, max_grad_norm=1, max_length=100, mode='CM', optim='adamw', seed=42, train_prop=1.0, warmup_steps=500, weight_decay=1e-05)
    Use all train data!
    pre process train.txt
    The number of sentence in which tokenizer can't find head/tail entity is 0
    pre process dev.txt
    The number of sentence in which tokenizer can't find head/tail entity is 0
    pre process test.txt
    The number of sentence in which tokenizer can't find head/tail entity is 0
    ********* load from ckpt/../pretrain/ckpt/ERICA_bert_uncased_RP ***********
    successful load ckpt
    Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.
    
    Defaults for this optimization level are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Processing user overrides (additional kwargs that are not None)...
    After processing overrides, optimization options are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
    Begin train...
    We will train model in 17032 steps
    

    i tried to delete the specific [main.py/line 103] 'model = nn.DataParallel(model)' and add 'model.cuda()' .but get error:

    Traceback (most recent call last):
      File "main.py", line 311, in <module>
        train(args, model, train_dataloader, dev_dataloader, test_dataloader)
      File "main.py", line 131, in train
        loss, output = model(**inputs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/root/ERICA/finetune/Sent_level_RE/code/re/model.py", line 36, in forward
        outputs = self.bert(input_ids, mask)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 753, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_bert.py", line 178, in forward
        inputs_embeds = self.word_embeddings(input_ids)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
        self.norm_type, self.scale_grad_by_freq, self.sparse)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select
    

    Is it possible that the batch data[main.py/line:114] is stored in the CPU while the model is stored in the GPU?

    opened by zweny 2
  • Where did you modified src/transformers/modeling_bert.py?

    Where did you modified src/transformers/modeling_bert.py?

    I expected that file src/transformers/modeling_bert.py would add Loss_ED and Loss_RD, but I didn't find the relevant code? In which file did you add Loss_ED and Loss_RD?

    opened by WenxiongLiao 1
  • scripts for preparing pre-train data

    scripts for preparing pre-train data

    Hi, Thank you for your great work. I am trying to work on reproducing the pretraining data, and possibly extending it to different languages. May I know how do you extract all_triples and all_qs from the wikipedia dump (latest-all.json)?

    opened by tonytan48 1
  • Loss for entities

    Loss for entities

    Thanks for your interesting work!

    When I went through the code, I didn't find the data preparation and loss for entities (pretraining) as described in the paper. Could you help me find the related codes? Thanks.

    opened by JiachengLi1995 0
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

Wang Yijun 109 Nov 29, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 7, 2022
Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

null 274 Dec 6, 2022
Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

xfbai 45 Dec 26, 2022
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 9, 2022
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

null 108 Dec 23, 2022
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

null 104 Jan 1, 2023
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

NeuralSymbolicRegressionThatScales Pytorch implementation and pretrained models for the paper "Neural Symbolic Regression That Scales", presented at I

null 35 Nov 25, 2022
The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Yusen Zhang 22 Nov 9, 2022
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Visual Intelligence Lab of UCSD 95 Jan 8, 2023
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

null 34 Dec 28, 2022
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

Kingdrone 174 Dec 22, 2022