"Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021)

Bo Wang

Last update: Dec 26, 2022

Related tags

Deep Learning StAR_KGC

Overview

STAR_KGC

This repo contains the source code of the paper accepted by WWW'2021. "Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021).

1. Thanks

The repository is partially based on huggingface transformers, KG-BERT and RotatE.

2. Installing requirement packages

conda create -n StAR python=3.6
source activate StAR
pip install numpy torch tensorboardX tqdm boto3 requests regex sacremoses sentencepiece matplotlib

2.1 Optional package (for mixed float Computation)

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

3. Dataset

WN18RR, FB15k-237, UMLS

Train and test set in ./data
As validation on original dev set is costly, we validated the model on dev subset during training.
The dev subset of WN18RR is provided in ./data/WN18RR called new_dev.dict. Use below commands to get the dev subset for WN18RR (FB15k-237 is similar without the --do_lower_case) used in training process.

 CUDA_VISIBLE_DEVICES=0 \
  python get_new_dev_dict.py \
 	--model_class bert \
 	--weight_decay 0.01 \
 	--learning_rate 5e-5 \
 	--adam_epsilon 1e-6 \
 	--max_grad_norm 0. \
 	--warmup_proportion 0.05 \
 	--do_train \
 	--num_train_epochs 7 \
 	--dataset WN18RR \
 	--max_seq_length 128 \
 	--gradient_accumulation_steps 4 \
 	--train_batch_size 16 \
 	--eval_batch_size 128 \
 	--logging_steps 100 \
 	--eval_steps -1 \
 	--save_steps 2000 \
 	--model_name_or_path bert-base-uncased \
 	--do_lower_case \
 	--output_dir ./result/WN18RR_get_dev \
 	--num_worker 12 \
 	--seed 42 \

 CUDA_VISIBLE_DEVICES=0 \
  python get_new_dev_dict.py \
 	--model_class bert \
 	--weight_decay 0.01 \
 	--learning_rate 5e-5 \
 	--adam_epsilon 1e-6 \
 	--max_grad_norm 0. \
 	--warmup_proportion 0.05 \
 	--do_eval \
 	--num_train_epochs 7 \
 	--dataset WN18RR \
 	--max_seq_length 128 \
 	--gradient_accumulation_steps 4 \
 	--train_batch_size 16 \
 	--eval_batch_size 128 \
 	--logging_steps 100 \
 	--eval_steps 1000 \
 	--save_steps 2000 \
 	--model_name_or_path ./result/WN18RR_get_dev \
 	--do_lower_case \
 	--output_dir ./result/WN18RR_get_dev \
 	--num_worker 12 \
 	--seed 42 \

NELL-One
- We reformat original NELL-One as the three benchmarks above.
- Please run the below command to get the reformatted data.
```
 python reformat_nell_one.py --data_dir path_to_downloaded --output_dir ./data/NELL_standard
```

4. Training and Test (StAR)

Run the below commands for reproducing results in paper. Note, all the eval_steps is set to -1 to train w/o validation and save the last checkpoint, because standard dev is very time-consuming. This can get similar results as in the paper.

4.1 WN18RR

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7 \
    --dataset WN18RR \
    --max_seq_length 128 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps 4000 \
    --save_steps 2000 \
    --model_name_or_path roberta-large \
    --output_dir ./result/WN18RR_roberta-large \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean \

CUDA_VISIBLE_DEVICES=2 \
python run_link_prediction.py \
    --model_class bert \
    --weight_decay 0.01 \
    --learning_rate 5e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7 \
    --dataset WN18RR \
    --max_seq_length 128 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps 4000 \
    --save_steps 2000 \
    --model_name_or_path bert-base-uncased \
    --do_lower_case \
    --output_dir ./result/WN18RR_bert \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean \

4.2 FB15k-237

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 7. \
    --dataset FB15k-237 \
    --max_seq_length 100 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 2000 \
    --model_name_or_path roberta-large \
    --output_dir ./result/FB15k-237_roberta-large \
    --num_worker 12 \
    --seed 42 \
    --fp16 \
    --cls_method cls \
    --distance_metric euclidean \

4.3 UMLS

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class roberta \
    --weight_decay 0.01 \
    --learning_rate 1e-5 \
    --adam_betas 0.9,0.98 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 0. \
    --warmup_proportion 0.05 \
    --do_train --do_eval \
    --do_prediction \
    --num_train_epochs 20 \
    --dataset UMLS \
    --max_seq_length 16 \
    --gradient_accumulation_steps 1 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 200 \
    --model_name_or_path roberta-large \
    --output_dir ./result/UMLS_model \
    --num_worker 12 \
    --seed 42 \
    --cls_method cls \
    --distance_metric euclidean

4.4 NELL-One

CUDA_VISIBLE_DEVICES=0 \
python run_link_prediction.py \
    --model_class bert \
    --do_train --do_eval \usepacka--do_prediction \
    --warmup_proportion 0.1 \
    --learning_rate 5e-5 \
    --num_train_epochs 8. \
    --dataset NELL_standard \
    --max_seq_length 32 \
    --gradient_accumulation_steps 1 \
    --train_batch_size 16 \
    --eval_batch_size 128 \
    --logging_steps 100 \
    --eval_steps -1 \
    --save_steps 2000 \
    --model_name_or_path bert-base-uncased \
    --do_lower_case \
    --output_dir ./result/NELL_model \
    --num_worker 12 \
    --seed 42 \
    --fp16 \
    --cls_method cls \
    --distance_metric euclidean

5. StAR_Self-Adp

5.1 Data preprocessing

Get the trained model of RotatE, more details please refer to RotatE.

Run the below commands sequentially to get the training dataset of StAR_Self-Adp.

Run the run_get_ensemble_data.py in ./StAR

 CUDA_VISIBLE_DEVICES=0 python run_get_ensemble_data.py \
 	--dataset WN18RR \
 	--model_class roberta \
 	--model_name_or_path ./result/WN18RR_roberta-large \
 	--output_dir ./result/WN18RR_roberta-large \
 	--seed 42 \
 	--fp16

Run the ./codes/run.py in rotate. (please replace the TRAINED_MODEL_PATH with your own trained model's path)

 CUDA_VISIBLE_DEVICES=3 python ./codes/run.py \
 	--cuda --init ./models/RotatE_wn18rr_0 \
 	--test_batch_size 16 \
 	--star_info_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
 	--get_scores --get_model_dataset

5.2 Train and Test

Run the run.py in ./StAR/ensemble. Note the --mode should be alternate in head and tail, and perform a average operation to get the final results.
Note: Please replace YOUR_OUTPUT_DIR, TRAINED_MODEL_PATH and StAR_FILE_PATH in ./StAR/peach/common.py with your own paths to run the command and code.

CUDA_VISIBLE_DEVICES=2 python run.py \
--do_train --do_eval --do_prediction --seen_feature \
--mode tail \
--learning_rate 1e-3 \
--feature_method mix \
--neg_times 5 \
--num_train_epochs 3 \
--hinge_loss_margin 0.6 \
--train_batch_size 32 \
--test_batch_size 64 \
--logging_steps 100 \
--save_steps 2000 \
--eval_steps -1 \
--warmup_proportion 0 \
--output_dir /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large_ensemble  \
--dataset_dir /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
--context_score_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
--translation_score_path /home/wangbo/workspace/StAR_KGC-master/rotate/models/RotatE_wn18rr_0  \
--seed 42

Comments

Input

thanks for your excellent work in KGC filed. i have not seen the code yet and i am confused about the input . the input text of entity is the short version like "Gary Rydstrom" or the long version like "Gary Roger Rydstrom is an American sound designer and director. He has won seven Academy Awards for his work in sound for movies."?

opened by Jiang-X-Pro 13
Reproducibility
Hi guys, good work, but I struggle a bit with reproducing your results. It's nothing serious, but it would be better to have clone-and-use approach. So far I encounter these little obstacles:

hardcoded paths, e.g. USER_HOME+"/workspace/StAR/data/"

It's a handy thing to have the exact commands to reproduce WN18RR in the readme, but I guess there are some incorrect paths, e.g. to run.py in RotatE/ folder with ./result/WN18RR_roberta-large/ (that should be ../StAR/results/WN18RR_roberta-large, right?); also guessing there is a typo in --output_dir ./result/FB15k-237_roberta-largel

When I followed your readme's commands exactly, I ended up with FileNotFoundError: [Errno 2] No such file or directory: './result/UMLS_model_roberta-large/similarity_score_mtx.npy'. The only place where */similarity_score_mtx.npy is saved is in get_ensembled_data.py which is not invocated anywhere in the project (although commented in get_ensembled_data.py).

Please, can you provide a fix for the similarity_score_mtx.npy missing file? I could simply remove the commented line but there is no mention of how to use get_ensembled_data.py.

best Martin
opened by martinsvat 10
The role of mixed float?

hi, Thanks for your sharing of this paper, i have a question about the 2.1 step which are the mixed float computation package, its function whether to speed up training ? and the second question is "KGE/StAR_KGC/StAR/transformers/configuration_utils.py", line 182, in from_pretrained raise EnvironmentError(msg) OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/bert/config.json' to download configuration file or configuration file is not a valid JSON file. Please check network or file content here: /root/.cache/torch/transformers/5c6eb663edfa884554ead90bd31a56bd13c655097d981a6a7a4f3fc67d17fdf6." i cannot find the file according its prompt, how to solve this problem? Hope for your reply. thanks!!

opened by yhjiujiu 4
problem of "new_dev.dict"

for the script "get_new_dev_dict.py", it need "new_dev.dict" in "evaluate", however, only WN18RR already have "new_dev.dict".

opened by njcx-ai 3
About negative sampling

Hi, I am a little confused with the negative sampling process. I saw in the paper, you have mentioned that the wrong triples are generated by replacing either the head or tail entity with another entity randomly sampled. But in the negetive_sampling function in kb_dataset.py, it has the probability to generate wrong triples through corrupting the relation. So, may I ask if there is a difference between these two methods? Hope you could help me with this problem, thanks a lot!

opened by LSX-Sneakerprogrammer 2
How to generate the new_dev.dict?

Hi Guy, good job. I met some problem when I try to run my dataset. I wanna know How to generate the new_dev.dict?

I use "--do_eval" in link prediction, it will show an error below: Traceback (most recent call last): File "run_link_prediction.py", line 4, in main() File "/data2/lidj/projects/StAR_KGC/StAR/link_prediction/train.py", line 950, in main train(args, train_dataset, model, tokenizer, eval_dataset=dev_dataset, eval_fn=evaluate_pairwise_ranking) File "/data2/lidj/projects/StAR_KGC/StAR/kbc/utils_fn.py", line 279, in train args, eval_dataset, model, tokenizer, global_step=global_step, file_prefix="eval_") File "/data2/lidj/projects/StAR_KGC/StAR/link_prediction/train.py", line 625, in evaluate_pairwise_ranking dev_dict = torch.load("./data/" + args.dataset + "/new_dev.dict") File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 579, in load with _open_file_like(f, 'rb') as opened_file: File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/data2/lidj/venv/StAR/lib/python3.6/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './data/ctd/new_dev.dict'

However, if i didn't use "--do_eval" in link prediction, it will generate new_dev.dict at last. Do we have any method to generate the new_dev.dict before running link prediction?

Hope you reply. Thanks a lot.

opened by tideng777 1
how to generate the prediction result in detail?

Hi guys, good work. I wanna to check hit@10 results in detail. I mean how to generate the candicates result for test.tsv? Which file show the result of prediction candicates in detail, not just metrices? Thanks a lot.

opened by tideng777 1
StAR inference (reproducibility of test_head_full_scores.list file)
Hi, I've been working with your codebase a while but there is one issue, which occurred along the path and I haven't been able to overcome it so far. For a start, my usage of StAR (ensemble, and others) is a little bit different. I don't work with batches, rather than that I need one single object (let's say EnsembeModel or StarModel) which I query using a triple and get a probability value of that particular triple in return.

So far, I managed to dig through the codebase to do the things I need, but when I started verifying my implementation with your results, I failed to obtain the same results. So, please, how can I replicate those values that are in StAR model, e.g. WN18RR_roberta-large/test_head_full_scores.list ?

I got that I should do something like

with torch.no_grad(): logits = model.classifier(_rep_src, _rep_tgt) logits = torch.softmax(logits, dim=-1) local_scores = logits.detach().cpu().numpy()[:, 1] resultForTheTriple = local_scores[0]

Right?

I've been following the method get_ensemble_data.py:get_scores which is buggy (namely because of id2ent_list = dataset.id2ent_list), and secondly, it is not invoked anywhere in the codebase (I haven't found such invocation :( hope I am mistaken). Anyhow, when I use this method to get the result for the first test-query (['06845599', '_member_of_domain_usage', '03754979']) in the WN18RR dataset I get something like 0.017418645322322845 which is not the value stored in WN18RR_roberta-large/test_head_full_scores.list (actually, loading this file, e.g. l = toch.load("test_head_full_scores.list"), and then inferring the value, e.g. l0][1][l[0][0]], yields something like 0.9994868. Secondly, when I tried to implement the same method on my own, I ran into indeterminism, e.g. running the exact same script resulted in different scores (for the first test-query mentioned above). I checked that the embeddings are the same in both places (they are, I'm using your loading of embeddings), but the output of the model, e.g. model.classifier(_rep_src, _rep_tgt), differ every time. Have you seen anything like this? For example, is there some drop-out or something similar in (Ro)Berta model that should be set up prior to evaluation besides "model.eval()"?

So, you see that I actually want to produce a file with StAR scores on my own (e.g. WN18RR_roberta-large/test_head_full_scores.list) but am unable to do so (and deterministically). Please, can you point me to the place in the code where this is happening or say where I made any mistake? I would appreciate it all :) Thanks a lot.

best, Martin
opened by martinsvat 3

Owner

Bo Wang

Ph.D. student at the School of Artificial Intelligence, Jilin University.

GitHub

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

59 Nov 24, 2022

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

156 Nov 28, 2022

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

76 Nov 23, 2022

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

210 Dec 30, 2022

CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

104 Oct 26, 2022

An implementation of Deep Forest 2021.2.1.

Deep Forest (DF) 21 DF21 is an implementation of Deep Forest 2021.2.1. It is designed to have the following advantages: Powerful: Better accuracy than

795 Jan 3, 2023

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

272 Dec 28, 2022

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

366 Dec 28, 2022

Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

CSE-Autoloss Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models

54 Dec 17, 2022

"Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021)

Related tags

Overview

STAR_KGC

1. Thanks

2. Installing requirement packages

2.1 Optional package (for mixed float Computation)

3. Dataset

4. Training and Test (StAR)

4.1 WN18RR

4.2 FB15k-237

4.3 UMLS

4.4 NELL-One

5. StAR_Self-Adp

5.1 Data preprocessing

5.2 Train and Test

Comments

Owner

Bo Wang

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

CVPR 2021 Challenge on Super-Resolution Space

An implementation of Deep Forest 2021.2.1.

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Official implementation of the ICLR 2021 paper

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

Pytorch implementation of BRECQ, ICLR 2021

Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).