Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

Michael Galkin

Last update: Jan 4, 2023

Related tags

Deep Learning NodePiece

Overview

NodePiece - Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

NodePiece is a "tokenizer" for reducing entity vocabulary size in knowledge graphs. Instead of shallow embedding every node to a vector, we first "tokenize" each node by K anchor nodes and M relation types in its relational context. Then, the resulting hash sequence is encoded through any injective function, e.g., MLP or Transformer.

Similar to Byte-Pair Encoding and WordPiece tokenizers commonly used in NLP, NodePiece can tokenize unseen nodes attached to the seen graph using the same anchor and relation vocabulary, which allows NodePiece to work out-of-the-box in the inductive settings using all the well-known scoring functions in the classical KG completion (like TransE or RotatE). NodePiece also works with GNNs (we tested on node classification, but not limited to it, of course).

NodePiece source code

The repo contains the code and experimental setups for reproducibility studies.

Each experiment resides in the respective folder:

LP_RP - link prediction and relation prediction
NC - node classification
OOS_LP - out-of-sample link prediction

The repo is based on Python 3.8. wandb is an optional requirement in case you have an existing account there and would like to track experimental results. If you have a wandb account, the repo assumes you've performed

wandb login <your_api_key>

Using a GPU is recommended.

First, run a script which will download all the necessary pre-processed data and datasets. It takes approximately 1 GB.

sh download_data.sh

We packed the pre-processed data for faster experimenting with the repo. Note that there are two NodePiece tokenization modes (-tkn_mode [option]): path and bfs:

path is an old tokenization strategy (based on finding shortest paths between each node and all anchors) under which we performed the experiments and packed the data for reproducibility;
bfs is a new strategy (based on iterative expansion of node's neighborhood until a desired number of anchors is reached) which is 5-50x faster and takes 4-5x less space depending on the KG. Currently, works for transductive LP/RP tasks;

Pre-processing times tested on M1 MacBook Pro / 8 GB:

mode	FB15k-237 / vocab size	WN18RR / vocab size	YAGO 3-10 / vocab size
`path`	2 min / 28 MB	5 min / 140 MB	~ 5 hours / 240 MB
`bfs`	8 sec / 7.5 MB	30 sec / 20 MB	4.5 min / 40 MB

CoDEx-Large and YAGO path pre-processing is better run on a server with 16-32 GB RAM and might take 2-5 hours depending on the chosen number of anchors.

NB: we seek to further improve the algorithms to make the tokenization process even faster than the bfs strategy.

Second, install the dependencies in requirements.txt. Note that when installing Torch-Geometric you might want to use pre-compiled binaries for a certain version of python and torch. Check the manual here.

In the link prediction tasks, all the necessary datasets will be downloaded upon first script execution.

Link Prediction

The link prediction (LP) and relation prediction (RP) tasks use models, datasets, and evaluation protocols from PyKEEN.

Navigate to the lp_rp folder: cd lp_rp.

The list of CLI params can be found in run_lp.py.

Run the fb15k-237 experiment

python run_lp.py -loop lcwa -loss bce -b 512 -data fb15k237 -anchors 1000 -sp 100 -lr 0.0005 -ft_maxp 20 -pool cat -embedding 200 -sample_rels 15 -smoothing 0.4 -epochs 401

Run the wn18rr experiment

python run_lp.py -loop slcwa -loss nssal -margin 15 -b 512 -data wn18rr -anchors 500 -sp 100 -lr 0.0005 -ft_maxp 50 -pool cat -embedding 200 -negs 20 -subbatch 2000 -sample_rels 4 -epochs 601

Run the codex-l experiment

python run_lp.py -loop lcwa -loss bce -b 256 -data codex_l -anchors 7000 -sp 100 -lr 0.0005 -ft_maxp 20 -pool cat -embedding 200 -subbatch 10000 -sample_rels 6 -smoothing 0.3 -epochs 120

Run the yago 3-10 experiment

python run_lp.py -loop slcwa -loss nssal -margin 50 -b 512 -data yago -anchors 10000 -sp 100 -lr 0.00025 -ft_maxp 20 -pool cat -embedding 200 -subbatch 2000 -sample_rels 5 -negs 10 -epochs 601

Test evaluation reproducibility patch

PyKEEN 1.0.5 used in this repo has been identified to have issues at the filtering stage when evaluating on the test set. In order to fully reproduce the reported test set numbers for transductive LP/RP experiments from the paper and resolve this issue, please apply the patch from the lp_rp/patch folder:

Locate pykeen in your environment installation:

<path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen

Replace the evaluation/evaluator.py with the one from the patch folder

cp ./lp_rp/patch/evaluator.py <path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen/evaluation/

Replace the stoppers/early_stopping.py with the one from the patch folder

cp ./lp_rp/patch/early_stopping.py <path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen/stoppers/

This won't be needed once we port the codebase to newest versions of PyKEEN (1.4.0+) where this was fixed

Relation Prediction

The setup is very similar to that of link prediction (LP) but we predict relations (h,?,t) now.

Navigate to the lp_rp folder: cd lp_rp.

The list of CLI params can be found in run_lp.py

Run the fb15k-237 experiment

python run_lp.py -loop slcwa -loss nssal -b 512 -data fb15k237 -anchors 1000 -sp 100 -lr 0.0005 -ft_maxp 20 -margin 15 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 15 -epochs 21 --rel-prediction True

Run the wn18rr experiment

python run_lp.py -loop slcwa -loss nssal -b 512 -data wn18rr -anchors 500 -sp 100 -lr 0.0005 -ft_maxp 50 -margin 12 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 4 -epochs 151 --rel-prediction True

Run the yago 3-10 experiment

python run_lp.py -loop slcwa -loss nssal -b 512 -data yago -anchors 10000 -sp 100 -lr 0.0005 -ft_maxp 20 -margin 25 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 5 -epochs 7 --rel-prediction True

Node Classification

Navigate to the nc folder: cd nc .

The list of CLI params can be found in run_nc.py

If you have a GPU, use DEVICE cuda otherwise DEVICE cpu.

The run on 5% of labeled data:

python run_nc.py DATASET wd50k MAX_QPAIRS 3 STATEMENT_LEN 3 LABEL_SMOOTHING 0.1 EVAL_EVERY 5 DEVICE cpu WANDB False EPOCHS 4001 GCN_HID_DROP2 0.5 GCN_HID_DROP 0.5 GCN_FEAT_DROP 0.5 EMBEDDING_DIM 100 GCN_GCN_DIM 100 LEARNING_RATE 0.001 GCN_ATTENTION True GCN_GCN_DROP 0.3 GCN_ATTENTION_DROP 0.3 GCN_LAYERS 3 DS_TYPE transductive MODEL_NAME stare TR_RATIO 0.05 USE_FEATURES False TOKENIZE True NUM_ANCHORS 50 MAX_PATHS 10 USE_TEST True

The run on 10% of labeled data:

python run_nc.py DATASET wd50k MAX_QPAIRS 3 STATEMENT_LEN 3 LABEL_SMOOTHING 0.1 EVAL_EVERY 5 DEVICE cpu WANDB False EPOCHS 4001 GCN_HID_DROP2 0.5 GCN_HID_DROP 0.5 GCN_FEAT_DROP 0.5 EMBEDDING_DIM 100 GCN_GCN_DIM 100 LEARNING_RATE 0.001 GCN_ATTENTION True GCN_GCN_DROP 0.3 GCN_ATTENTION_DROP 0.3 GCN_LAYERS 3 DS_TYPE transductive MODEL_NAME stare TR_RATIO 0.1 USE_FEATURES False TOKENIZE True NUM_ANCHORS 50 MAX_PATHS 10 USE_TEST True

Out-of-sample Link Prediction

Navigate to the oos_lp folder: cd oos_lp/src.

The list of CLI params can be found in main.py.

Run the oos fb15k-237 experiment

python main.py -dataset FB15k-237 -model_name DM_NP_fb -ne 41 -lr 0.0005 -emb_dim 200 -batch_size 256 -simulated_batch_size 256 -save_each 100 -tokenize True -opt adam -pool trf -use_custom_reg False -reg_lambda 0.0 -loss_fc spl -margin 15 -neg_ratio 5 -wandb False -eval_every 20 -anchors 1000 -sample_rels 15

Run the oos yago3-10 experiment

python main.py -dataset YAGO3-10 -model_name DM_NP_yago -ne 41 -lr 0.0005 -emb_dim 200 -batch_size 256 -simulated_batch_size 256 -save_each 100 -tokenize True -opt adam -pool trf -use_custom_reg False -reg_lambda 0.0 -loss_fc spl -margin 15 -neg_ratio 5 -wandb False -eval_every 20 -anchors 10000 -sample_rels 5

Citation

If you find this work useful, please consider citing the paper:

@misc{galkin2021nodepiece,
    title={NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs},
    author={Mikhail Galkin and Jiapeng Wu and Etienne Denis and William L. Hamilton},
    year={2021},
    eprint={2106.12144},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Comments

KeyError: 'quals'

Hi,

Thanks for the great work. I was trying to run node prediction experiment on my local machine. Was prompted key error of 'quals'. Later in the code I found that subtype="triples" if config["STATEMENT_LEN"] == 3 else "statements". When subtype=triples, quals is None. However, in the demo run command, STATEMENT_LEN is 3 hence such error occurs. Please correct me if I am wrong. Would be good if you could update a workable set-up. Thank you.

opened by lynnna-xu 9

TypeError: evaluate() got an unexpected keyword argument 'additional_filter_triples'

Hi,

Thanks for the great work. I was trying to run the inductive_lp experiment on RTX3090. Because of the hashrate mismatch, I can't install pytorch=1.7.1 on the RTX3090. So I directly chose to configure the corresponding environment with cuda11.6.

Traceback (most recent call last):
  File "/home/ps/hh/NodePiece-main/inductive_lp/run_ilp.py", line 236, in <module>
    main()
  File "/home/ps/miniconda3/envs/nodepiece/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ps/miniconda3/envs/nodepiece/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ps/miniconda3/envs/nodepiece/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ps/miniconda3/envs/nodepiece/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ps/hh/NodePiece-main/inductive_lp/run_ilp.py", line 215, in main
    metric_results = test_evaluator.evaluate(
TypeError: evaluate() got an unexpected keyword argument 'additional_filter_triples'

Below you may find a list of package versions in my environment.

python         3.8.13
pykeen                    1.0.5
python-igraph             0.10.1
click                     8.1.3
einops                    0.5.0
numpy                     1.23.3
torch                     1.12.1+cu116            
torch-cluster             1.6.0                   
torch-geometric           2.1.0.post1         
torch-scatter             2.0.9                 
torch-sparse              0.6.15              
torch-spline-conv         1.2.1              
tqdm                      4.64.1

Please correct me if I am wrong. Would be good if you could give me some advice.If it is a very basic and simple mistake, allow me to apologize for taking up your precious time.Thank you.

opened by huang0926huang 2

Full vs negsamp-based evaluation on WikiKGv2?

Hi

Congratulations on the ICLR acceptance!

I noticed in the paper that on the link prediction task, performance of NodePiece is significantly better compared to strong baselines (eg. ComplEx) on WikiKGv2 despite being consistently worse on other datasets (fb15k-237, wn18rr, yago, codex). Do you think this is due to WikiKGv2 being a large dataset, or some other reason?

As you know, unlike other datasets, wikikgv2 uses negative sampling while evaluating (1 true answer needs to be ranked correctly among 500 random negatives). IMO this is incorrect and unnecessary, since it makes the task much easier - for eg. ogb's own wikikg90m changed from negative sampling based evaluation to full evaluation (rank among all entities) in the v2 of their dataset due to suspiciously high performance of several methods.

My question is - have you tried full evaluation (ie rank all 2.5M entities) of NodePiece on wikikgv2? Does NodePiece still beat ComplEx when evaluating on the full setting, or is it just on the much easier negsamp-based evaluation?

Also, do you have plans on trying NodePiece on other large KGs such as WikiData5M, WikiKG90Mv2 etc.?

Thanks

opened by apoorvumang 2
OGB model documentation

Good evening. I have a question , regarding your code for the ogb wikidata. I've been working on graph neural networks lately, but because it's something new to me I'm having difficulties. Could I find somewhere documentation for the code and more specifically for the methods.Thanks in advance.

opened by Percefoni 1
Performance on million-node datasets + hyperparameter tuning?

Hi

Really cool work! I was wondering whether you have results for million-node datasets such as WikiData5m or WikiKG 2, or does it take too long to preprocess those?

Also did you do any kind of hyperparameter search?

Thanks

opened by apoorvumang 1

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

Related tags

Overview

NodePiece - Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

NodePiece source code

Link Prediction

Test evaluation reproducibility patch

Relation Prediction

Node Classification

Out-of-sample Link Prediction

Citation

Comments

KeyError: 'quals'

TypeError: evaluate() got an unexpected keyword argument 'additional_filter_triples'

Full vs negsamp-based evaluation on WikiKGv2?

OGB model documentation

Performance on million-node datasets + hyperparameter tuning?

Owner

Michael Galkin

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

The Power of Scale for Parameter-Efficient Prompt Tuning

Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning"

A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Code to reproduce the results for Compositional Attention: Disentangling Search and Retrieval.

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

Compositional Sketch Search

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

[CVPR'22] COAP: Learning Compositional Occupancy of People

QA-GNN: Question Answering using Language Models and Knowledge Graphs

ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks

Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

Language models are open knowledge graphs ( non official implementation )

Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.