"Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021)

Overview

STAR_KGC

This repo contains the source code of the paper accepted by WWW'2021. "Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021).

1. Thanks

The repository is partially based on huggingface transformers, KG-BERT and RotatE.

2. Installing requirement packages

  • conda create -n StAR python=3.6
  • source activate StAR
  • pip install numpy torch tensorboardX tqdm boto3 requests regex sacremoses sentencepiece matplotlib
2.1 Optional package (for mixed float Computation)

3. Dataset

  • WN18RR, FB15k-237, UMLS

    • Train and test set in ./data
    • As validation on original dev set is costly, we validated the model on dev subset during training.
    • The dev subset of WN18RR is provided in ./data/WN18RR called new_dev.dict. Use below commands to get the dev subset for WN18RR (FB15k-237 is similar without the --do_lower_case) used in training process.
     CUDA_VISIBLE_DEVICES=0 \
      python get_new_dev_dict.py \
     	--model_class bert \
     	--weight_decay 0.01 \
     	--learning_rate 5e-5 \
     	--adam_epsilon 1e-6 \
     	--max_grad_norm 0. \
     	--warmup_proportion 0.05 \
     	--do_train \
     	--num_train_epochs 7 \
     	--dataset WN18RR \
     	--max_seq_length 128 \
     	--gradient_accumulation_steps 4 \
     	--train_batch_size 16 \
     	--eval_batch_size 128 \
     	--logging_steps 100 \
     	--eval_steps -1 \
     	--save_steps 2000 \
     	--model_name_or_path bert-base-uncased \
     	--do_lower_case \
     	--output_dir ./result/WN18RR_get_dev \
     	--num_worker 12 \
     	--seed 42 \
    
     CUDA_VISIBLE_DEVICES=0 \
      python get_new_dev_dict.py \
     	--model_class bert \
     	--weight_decay 0.01 \
     	--learning_rate 5e-5 \
     	--adam_epsilon 1e-6 \
     	--max_grad_norm 0. \
     	--warmup_proportion 0.05 \
     	--do_eval \
     	--num_train_epochs 7 \
     	--dataset WN18RR \
     	--max_seq_length 128 \
     	--gradient_accumulation_steps 4 \
     	--train_batch_size 16 \
     	--eval_batch_size 128 \
     	--logging_steps 100 \
     	--eval_steps 1000 \
     	--save_steps 2000 \
     	--model_name_or_path ./result/WN18RR_get_dev \
     	--do_lower_case \
     	--output_dir ./result/WN18RR_get_dev \
     	--num_worker 12 \
     	--seed 42 \
    
  • NELL-One

    • We reformat original NELL-One as the three benchmarks above.
    • Please run the below command to get the reformatted data.
     python reformat_nell_one.py --data_dir path_to_downloaded --output_dir ./data/NELL_standard
    

4. Training and Test (StAR)

Run the below commands for reproducing results in paper. Note, all the eval_steps is set to -1 to train w/o validation and save the last checkpoint, because standard dev is very time-consuming. This can get similar results as in the paper.

4.1 WN18RR

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7 \
    --dataset WN18RR \
    --max_seq_length 128 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps 4000 \
    --save_steps 2000 \
    --model_name_or_path roberta-large \
    --output_dir ./result/WN18RR_roberta-large \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean \
CUDA_VISIBLE_DEVICES=2 \
python run_link_prediction.py \
    --model_class bert \
    --weight_decay 0.01 \
    --learning_rate 5e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7 \
    --dataset WN18RR \
    --max_seq_length 128 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps 4000 \
    --save_steps 2000 \
    --model_name_or_path bert-base-uncased \
    --do_lower_case \
    --output_dir ./result/WN18RR_bert \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean \

4.2 FB15k-237

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7. \
    --dataset FB15k-237 \
    --max_seq_length 100 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 2000 \
    --model_name_or_path roberta-large \
    --output_dir ./result/FB15k-237_roberta-large \
    --num_worker 12 \
    --seed 42 \
    --fp16 \
    --cls_method cls \
    --distance_metric euclidean \

4.3 UMLS

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 20 \
    --dataset UMLS \
    --max_seq_length 16 \
    --gradient_accumulation_steps 1 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 200 \
    --model_name_or_path roberta-large \
    --output_dir ./result/UMLS_model \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean 

4.4 NELL-One

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class bert \
    --do_train --do_eval \usepacka--do_prediction \
    --warmup_proportion 0.1 \
    --learning_rate 5e-5 \
    --num_train_epochs 8. \
    --dataset NELL_standard \
    --max_seq_length 32 \
    --gradient_accumulation_steps 1 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 2000 \
    --model_name_or_path bert-base-uncased \
    --do_lower_case \
    --output_dir ./result/NELL_model \
    --num_worker 12 \
    --seed 42 \
    --fp16 \
    --cls_method cls \
    --distance_metric euclidean 

5. StAR_Self-Adp

5.1 Data preprocessing

  • Get the trained model of RotatE, more details please refer to RotatE.

  • Run the below commands sequentially to get the training dataset of StAR_Self-Adp.

    • Run the run_get_ensemble_data.py in ./StAR
     CUDA_VISIBLE_DEVICES=0 python run_get_ensemble_data.py \
     	--dataset WN18RR \
     	--model_class roberta \
     	--model_name_or_path ./result/WN18RR_roberta-large \
     	--output_dir ./result/WN18RR_roberta-large \
     	--seed 42 \
     	--fp16 
    
    • Run the ./codes/run.py in rotate. (please replace the TRAINED_MODEL_PATH with your own trained model's path)
     CUDA_VISIBLE_DEVICES=3 python ./codes/run.py \
     	--cuda --init ./models/RotatE_wn18rr_0 \
     	--test_batch_size 16 \
     	--star_info_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
     	--get_scores --get_model_dataset 
    

5.2 Train and Test

  • Run the run.py in ./StAR/ensemble. Note the --mode should be alternate in head and tail, and perform a average operation to get the final results.
  • Note: Please replace YOUR_OUTPUT_DIR, TRAINED_MODEL_PATH and StAR_FILE_PATH in ./StAR/peach/common.py with your own paths to run the command and code.
CUDA_VISIBLE_DEVICES=2 python run.py \
--do_train --do_eval --do_prediction --seen_feature \
--mode tail \
--learning_rate 1e-3 \
--feature_method mix \
--neg_times 5 \
--num_train_epochs 3 \
--hinge_loss_margin 0.6 \
--train_batch_size 32 \
--test_batch_size 64 \
--logging_steps 100 \
--save_steps 2000 \
--eval_steps -1 \
--warmup_proportion 0 \
--output_dir /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large_ensemble  \
--dataset_dir /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
--context_score_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
--translation_score_path /home/wangbo/workspace/StAR_KGC-master/rotate/models/RotatE_wn18rr_0  \
--seed 42 
Comments
  • Input

    Input

    thanks for your excellent work in KGC filed. i have not seen the code yet and i am confused about the input . the input text of entity is the short version like "Gary Rydstrom" or the long version like "Gary Roger Rydstrom is an American sound designer and director. He has won seven Academy Awards for his work in sound for movies."?

    opened by Jiang-X-Pro 13
  • Reproducibility

    Reproducibility

    Hi guys, good work, but I struggle a bit with reproducing your results. It's nothing serious, but it would be better to have clone-and-use approach. So far I encounter these little obstacles:

    • hardcoded paths, e.g. USER_HOME+"/workspace/StAR/data/"
    • It's a handy thing to have the exact commands to reproduce WN18RR in the readme, but I guess there are some incorrect paths, e.g. to run.py in RotatE/ folder with ./result/WN18RR_roberta-large/ (that should be ../StAR/results/WN18RR_roberta-large, right?); also guessing there is a typo in --output_dir ./result/FB15k-237_roberta-largel
    • When I followed your readme's commands exactly, I ended up with FileNotFoundError: [Errno 2] No such file or directory: './result/UMLS_model_roberta-large/similarity_score_mtx.npy'. The only place where */similarity_score_mtx.npy is saved is in get_ensembled_data.py which is not invocated anywhere in the project (although commented in get_ensembled_data.py).

    Please, can you provide a fix for the similarity_score_mtx.npy missing file? I could simply remove the commented line but there is no mention of how to use get_ensembled_data.py.

    best Martin

    opened by martinsvat 10
  • The role of mixed float?

    The role of mixed float?

    hi, Thanks for your sharing of this paper, i have a question about the 2.1 step which are the mixed float computation package, its function whether to speed ​​up training ? and the second question is "KGE/StAR_KGC/StAR/transformers/configuration_utils.py", line 182, in from_pretrained raise EnvironmentError(msg) OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/bert/config.json' to download configuration file or configuration file is not a valid JSON file. Please check network or file content here: /root/.cache/torch/transformers/5c6eb663edfa884554ead90bd31a56bd13c655097d981a6a7a4f3fc67d17fdf6." i cannot find the file according its prompt, how to solve this problem? Hope for your reply. thanks!!

    opened by yhjiujiu 4
  • problem of

    problem of "new_dev.dict"

    for the script "get_new_dev_dict.py", it need "new_dev.dict" in "evaluate", however, only WN18RR already have "new_dev.dict".

    opened by njcx-ai 3
  • About negative sampling

    About negative sampling

    Hi, I am a little confused with the negative sampling process. I saw in the paper, you have mentioned that the wrong triples are generated by replacing either the head or tail entity with another entity randomly sampled. But in the negetive_sampling function in kb_dataset.py, it has the probability to generate wrong triples through corrupting the relation. So, may I ask if there is a difference between these two methods? Hope you could help me with this problem, thanks a lot!

    opened by LSX-Sneakerprogrammer 2
  • How to generate the new_dev.dict?

    How to generate the new_dev.dict?

    Hi Guy, good job. I met some problem when I try to run my dataset. I wanna know How to generate the new_dev.dict?

    I use "--do_eval" in link prediction, it will show an error below: Traceback (most recent call last): File "run_link_prediction.py", line 4, in main() File "/data2/lidj/projects/StAR_KGC/StAR/link_prediction/train.py", line 950, in main train(args, train_dataset, model, tokenizer, eval_dataset=dev_dataset, eval_fn=evaluate_pairwise_ranking) File "/data2/lidj/projects/StAR_KGC/StAR/kbc/utils_fn.py", line 279, in train args, eval_dataset, model, tokenizer, global_step=global_step, file_prefix="eval_") File "/data2/lidj/projects/StAR_KGC/StAR/link_prediction/train.py", line 625, in evaluate_pairwise_ranking dev_dict = torch.load("./data/" + args.dataset + "/new_dev.dict") File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 579, in load with _open_file_like(f, 'rb') as opened_file: File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './data/ctd/new_dev.dict'

    However, if i didn't use "--do_eval" in link prediction, it will generate new_dev.dict at last. Do we have any method to generate the new_dev.dict before running link prediction?

    Hope you reply. Thanks a lot.

    opened by tideng777 1
  • how to generate the prediction result in detail?

    how to generate the prediction result in detail?

    Hi guys, good work. I wanna to check hit@10 results in detail. I mean how to generate the candicates result for test.tsv? Which file show the result of prediction candicates in detail, not just metrices? Thanks a lot.

    image

    opened by tideng777 1
  • StAR inference (reproducibility of test_head_full_scores.list file)

    StAR inference (reproducibility of test_head_full_scores.list file)

    Hi, I've been working with your codebase a while but there is one issue, which occurred along the path and I haven't been able to overcome it so far. For a start, my usage of StAR (ensemble, and others) is a little bit different. I don't work with batches, rather than that I need one single object (let's say EnsembeModel or StarModel) which I query using a triple and get a probability value of that particular triple in return.

    So far, I managed to dig through the codebase to do the things I need, but when I started verifying my implementation with your results, I failed to obtain the same results. So, please, how can I replicate those values that are in StAR model, e.g. WN18RR_roberta-large/test_head_full_scores.list ?

    I got that I should do something like

    with torch.no_grad():
      logits = model.classifier(_rep_src, _rep_tgt)
      logits = torch.softmax(logits, dim=-1)
      local_scores = logits.detach().cpu().numpy()[:, 1]
      resultForTheTriple = local_scores[0]
    

    Right?

    I've been following the method get_ensemble_data.py:get_scores which is buggy (namely because of id2ent_list = dataset.id2ent_list), and secondly, it is not invoked anywhere in the codebase (I haven't found such invocation :( hope I am mistaken). Anyhow, when I use this method to get the result for the first test-query (['06845599', '_member_of_domain_usage', '03754979']) in the WN18RR dataset I get something like 0.017418645322322845 which is not the value stored in WN18RR_roberta-large/test_head_full_scores.list (actually, loading this file, e.g. l = toch.load("test_head_full_scores.list"), and then inferring the value, e.g. l0][1][l[0][0]], yields something like 0.9994868. Secondly, when I tried to implement the same method on my own, I ran into indeterminism, e.g. running the exact same script resulted in different scores (for the first test-query mentioned above). I checked that the embeddings are the same in both places (they are, I'm using your loading of embeddings), but the output of the model, e.g. model.classifier(_rep_src, _rep_tgt), differ every time. Have you seen anything like this? For example, is there some drop-out or something similar in (Ro)Berta model that should be set up prior to evaluation besides "model.eval()"?

    So, you see that I actually want to produce a file with StAR scores on my own (e.g. WN18RR_roberta-large/test_head_full_scores.list) but am unable to do so (and deterministically). Please, can you point me to the place in the code where this is happening or say where I made any mistake? I would appreciate it all :) Thanks a lot.

    best, Martin

    opened by martinsvat 3
Owner
Bo Wang
Ph.D. student at the School of Artificial Intelligence, Jilin University.
Bo Wang
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Phil Wang 59 Nov 24, 2022
[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

VITA 156 Nov 28, 2022
The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

Angtian Wang 76 Nov 23, 2022
This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

Yunfan Li 210 Dec 30, 2022
CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

andreas 104 Oct 26, 2022
An implementation of Deep Forest 2021.2.1.

Deep Forest (DF) 21 DF21 is an implementation of Deep Forest 2021.2.1. It is designed to have the following advantages: Powerful: Better accuracy than

LAMDA Group, Nanjing University 795 Jan 3, 2023
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Qing-Long Zhang 199 Jan 8, 2023
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

CSE-Autoloss Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models

Peidong Liu(刘沛东) 54 Dec 17, 2022
[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

Counterfactual Zero-Shot and Open-Set Visual Recognition This project provides implementations for our CVPR 2021 paper Counterfactual Zero-S

null 144 Dec 24, 2022
PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Neural Scene Flow Fields PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 20

Zhengqi Li 585 Jan 4, 2023
Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training This project hosts the code for implementing the DenseCL algorithm for se

Xinlong Wang 491 Jan 3, 2023
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

null 232 Dec 25, 2022
[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

[ICLR 2021] RAPID: A Simple Approach for Exploration in Reinforcement Learning This is the Tensorflow implementation of ICLR 2021 paper Rank the Episo

Daochen Zha 48 Nov 21, 2022
Pytorch implementation of BRECQ, ICLR 2021

BRECQ Pytorch implementation of BRECQ, ICLR 2021 @inproceedings{ li&gong2021brecq, title={BRECQ: Pushing the Limit of Post-Training Quantization by Bl

Yuhang Li 148 Dec 28, 2022
Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

Language: 简体中文 | English Introduction This is the code for Multiple Instance Active Learning for Object Detection, CVPR 2021. Installation A Linux pla

Tianning Yuan 269 Dec 21, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022