=0.4.1 Python3 tqdm boto3 r" /> =0.4.1 Python3 tqdm boto3 r" /> =0.4.1 Python3 tqdm boto3 r"/>

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

Overview

ERNIE

Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities"

Reqirements:

  • Pytorch>=0.4.1
  • Python3
  • tqdm
  • boto3
  • requests
  • apex (If you want to use fp16, you should make sure the commit is 79ad5a88e91434312b43b4a89d66226be5f2cc98.)

Prepare Pre-train Data

Run the following command to create training instances.

  # Download Wikidump
  wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  # Download anchor2id
  wget -c https://cloud.tsinghua.edu.cn/f/1c956ed796cb4d788646/?dl=1 -O anchor2id.txt
  # WikiExtractor
  python3 pretrain_data/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
  # Modify anchors with 4 processes
  python3 pretrain_data/extract.py 4
  # Preprocess with 4 processes
  python3 pretrain_data/create_ids.py 4
  # create instances
  python3 pretrain_data/create_insts.py 4
  # merge
  python3 code/merge.py

If you want to get anchor2id by yourself, run the following code(this will take about half a day) after python3 pretrain_data/extract.py 4

  # extract anchors
  python3 pretrain_data/utils.py get_anchors
  # query Mediawiki api using anchor link to get wikibase item id. For more details, see https://en.wikipedia.org/w/api.php?action=help.
  python3 pretrain_data/create_anchors.py 256 
  # aggregate anchors 
  python3 pretrain_data/utils.py agg_anchors

Run the following command to pretrain:

  python3 code/run_pretrain.py --do_train --data_dir pretrain_data/merge --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --fp16 --max_seq_length 256

We use 8 NVIDIA-2080Ti to pre-train our model and there are 32 instances in each GPU. It takes nearly one day to finish the training (1 epoch is enough).

Pre-trained Model

Download pre-trained knowledge embedding from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf kg_embed.tar.gz

Download pre-trained ERNIE from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf ernie_base.tar.gz

Note that the extraction may be not completed in Windows.

Fine-tune

As most datasets except FewRel don't have entity annotations, we use TAGME to extract the entity mentions in the sentences and link them to their corresponding entitoes in KGs. We provide the annotated datasets Google Drive/Tsinghua Cloud.

tar -xvzf data.tar.gz

In the root directory of the project, run the following codes to fine-tune ERNIE on different datasets.

FewRel:

python3 code/run_fewrel.py   --do_train   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128
# evaluate
python3 code/eval_fewrel.py   --do_eval   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128

TACRED:

python3 code/run_tacred.py   --do_train   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4
# evaluate
python3 code/eval_tacred.py   --do_eval   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4

FIGER:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2
# evaluate
python3 code/eval_figer.py    --do_eval   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2

OpenEntity:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128
# evaluate
python3 code/eval_typing.py   --do_eval   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128

Some code is modified from the pytorch-pretrained-BERT. You can find the explanation of most parameters in pytorch-pretrained-BERT.

As the annotations given by TAGME have confidence score, we use --threshlod to set the lowest confidence score and choose the annotations whose scores are higher than --threshold. In this experiment, the value is usually 0.3 or 0.4.

The script for the evaluation of relation classification just gives the accuracy score. For the macro/micro metrics, you should use code/score.py which is from tacred repo.

python3 code/score.py gold_file pred_file

You can find gold_file and pred_file on each checkpoint in the output folder (--output_dir).

New Tasks:

If you want to use ERNIE in new tasks, you should follow these steps:

  • Use an entity-linking tool like TAGME to extract the entities in the text
  • Look for the Wikidata ID of the extracted entities
  • Take the text and entities sequence as input data

Here is a quick-start example (code/example.py) using ERNIE for Masked Language Model. We show how to annotate the given sentence with TAGME and build the input data for ERNIE. Note that it will take some time (around 5 mins) to load the model.

# If you haven't installed tagme
pip install tagme
# Run example
python3 code/example.py

Cite

If you use the code, please cite this paper:

@inproceedings{zhang2019ernie,
  title={{ERNIE}: Enhanced Language Representation with Informative Entities},
  author={Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun},
  booktitle={Proceedings of ACL 2019},
  year={2019}
}
Comments
  • What is the meaning of entity column

    What is the meaning of entity column

    in the train.csv. the typical format of an entity is [['Q8029103', 139, 143, 0.5], [......]] Here 'Q8029103' is the identifier of the entity. what is the meaning for 139, 143, 0.5 ?

    opened by hanjie0 5
  • Source code TACRED F1 score doesn't correspond with the F1 score in the paper

    Source code TACRED F1 score doesn't correspond with the F1 score in the paper

    Hello, I fine-tuned ERNIE on TACRED for the relation classification task. I got the ERNIE F1 score at 66.xx which doesn't correspond with the F1 score 67.97 in the ERNIE paper. I followed the same instructions and the same parameters as you mentioned on Github. There's no problem with FewRel and Figer but I obtained the poorer score on OpenEntity and TACRED, especially TACRED.
    Could you please give me the parameters you used in the ERNIE paper on OpenEntity and TACRED? I wanna obtain the same results as your paper mentioned.

    opened by 106753004 4
  • Which entities are sampled for pre-training TransE KGembedding?

    Which entities are sampled for pre-training TransE KGembedding?

    Thanks for uploading code. I have quesition about paper,

    we sample part of Wikidata which contains 5, 040, 986 entities and 24, 267, 796 fact triples.

    Actually I previously asked this, but what I'd like to know is

    1. How entities are sampled from wikidata?

    2. How many entities are sampled from wikidata?

    3. How wikidata's entity and wikipedia entity are aligned?

    If you'd know about these or where in codes this sampling part exists, I'd appreciate it much. Thanks.

    opened by izuna385 4
  • Questions regarding Tagme

    Questions regarding Tagme

    Hi! I'm wondering how do you manage to label large corpus like Wikipedia using Tagme in short time? From my experience with Tagme API, the response time can be pretty slow when labeling a large collection of long articles and it may require days to fully annotate the Wikipedia corpus. Is there any "local" version of Tagme or am I simply missing something here? Any help would be appreciated.

    Best regards

    opened by Megavoxel01 4
  • bermodel应用

    bermodel应用

    你好,在使用预训练模型生成词向量时,input是什么参数呢? 我使用了以下数据测试,但是显示错误(能否提供一个使用BertModel的example呢?) model, _=BertModel.from_pretrained(args.ernie_model) inputs_id=torch.tensor([[ 101, 2207, 5273, 109, 791, 1921, 109, 102], [ 101, 3616, 3152, 109, 809, 711, 109, 102], [ 101, 2207, 5273, 108, 1762, 108, 1408, 102]]) att_mask=torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]],dtype=torch.long)

    token_type_id=att_mask input_ent=att_mask ents_mask=att_mask

    , token_type_ids=None, attention_mask=None, input_ent=None, ent_mask=None

    output = model(input_ids=inputs_id,input_ent=input_ent,ent_mask=ents_mask)

    image

    opened by Hou-jing 3
  • when using the model provided here(ernie_base) with multiple gpus, something wrong.

    when using the model provided here(ernie_base) with multiple gpus, something wrong.

    Hi, when i use the model(ernie_base) provided here for training with multiple GPU, i met some problem. In detail, i replace the original bert-base-uncased with the ernie-base, i find this issue. image as you can see, i use the gpu 5,7 , but the process running on the 7th gpu occupy a lot of memory on the 5th gpu, and the original bert didn't have this problem, so i pretty sure that it is your model provided here having some problem or the code you provided here having some probelm, can you take a look, or have you met the same problem?

    bug 
    opened by gaozhiguang 3
  • expected backend CUDA and dtype Float but got backend CUDA and dtype Half

    expected backend CUDA and dtype Float but got backend CUDA and dtype Half

    I run the command like this: python code/run_pretrain.py --do_train --data_dir pretrain_data/sample --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --max_seq_length 256

    However, it seems to have some mistake: raceback (most recent call last): | 0/10 [00:00<?, ?it/s] File "code/run_pretrain.py", line 421, in <module> main() File "code/run_pretrain.py", line 320, in main loss, original_loss = model(input_ids, segment_ids, input_mask, masked_lm_labels, input_ent, ent_mask, next_sentence_label, ent_candidate, ent_labels) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 833, in forward output_all_encoded_layers=False) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 765, in forward output_all_encoded_layers=output_all_encoded_layers) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 443, in forward hidden_states, hidden_states_ent = layer_module(hidden_states, attention_mask, hidden_states_ent, attention_mask_ent, ent_mask) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 382, in forward attention_output_ent = hidden_states_ent * ent_mask RuntimeError: expected backend CUDA and dtype Float but got backend CUDA and dtype Half I don't know which part that leads to this problem. Need help.

    opened by DesmonDay 3
  • Which version of apex(commit) have you used

    Which version of apex(commit) have you used

    Hi, I was trying to replicate the experiments but failed to do so because of due to a version mismatch of apex. Would be grateful if you could let me know the exact commit id for this.

    opened by raghavlite 3
  • Can not find any code about

    Can not find any code about "[HD] and [TL]".

    The paper says that the tokens [HD] and [TL] are used to present head entities and tail entities respectively. But I can't find any code about this.

    thx. a lot!

    opened by lexmen318 3
  • 我想用自己的KG训练ERNIE,但不知道如何组织数据集

    我想用自己的KG训练ERNIE,但不知道如何组织数据集

    作者您好!我的疑问点如下: 我目前有自己的KG数据集,通过TransE训练得到了Graph Embedding;下一步想要训练ERNIE;

    1. 我尝试wget英文wiki数据(19G),但是它确实太大导致我无法下载成功;从而我无法复现pretrain_data的任何工作;
    2. 我对使用自己的数据去训练ERNIE时,完全不清楚如何组织我的数据集;
    opened by ly934060690 2
  • anchor2id.txt您访问的页面不存在

    anchor2id.txt您访问的页面不存在

    你好,当我准备Pre-train Data然后执行第二条命令时 wget -c https://cloud.tsinghua.edu.cn/f/1c956ed796cb4d788646/?dl=1 -O anchor2id.txt时,返回的是404 page not found,请问是把这个文件取消掉了吗?

    opened by estarpro 2
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Changes for compatibility with latest Apex version

    Changes for compatibility with latest Apex version

    Recent installations of Apex are incompatible with ERNIE. Made changes to match latest Apex version.

    1. FP16_Optimizer no longer under apex.optimizers, instead under apex.contrib.optimizers. Source: https://github.com/NVIDIA/apex/issues/593

    2. max_grad_norm no longer a parameter for FusedAdam optimizer. Source: https://nvidia.github.io/apex/optimizers.html

    opened by yuhongsun96 0
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

null 2 Feb 3, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
null 189 Jan 2, 2023
In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset.

Med-VQA In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset. Two of these are made on top of Facebook AI Reasearch's Multi-Mo

Kshitij Ambilduke 8 Apr 14, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

null 14 Jan 3, 2023
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 1, 2023