Inductive entity representations from text via link prediction
This repository contains the code used for the experiments in the paper "Inductive entity representations from text via link prediction", presented at The Web Conference, 2021. To refer to our work, please use the following:
@inproceedings{daza2021inductive,
title = {Inductive Entity Representations from Text via Link Prediction},
author = {Daniel Daza and Michael Cochez and Paul Groth},
booktitle = {Proceedings of The Web Conference 2021},
year = {2021},
doi = {10.1145/3442381.3450141},
}
In this work, we show how a BERT-based text encoder can be fine-tuned with a link prediction objective, in a graph where entities have an associated textual description. We call the resulting model BLP. There are three interesting properties of a trained BLP model:
- It can predict a link between entities, even if one or both were not present during training.
- It produces useful representations for a classifier, that don't require retraining the encoder.
- It improves an information retrieval system, by better matching entities and questions about them.
Usage
Please follow the instructions next to reproduce our experiments, and to train a model with your own data.
1. Install the requirements
Creating a new environment (e.g. with conda
) is recommended. Use requirements.txt
to install the dependencies:
conda create -n blp python=3.7
conda activate blp
pip install -r requirements.txt
2. Download the data
Download the required compressed datasets into the data
folder:
Download link | Size (compressed) |
---|---|
UMLS (small graph for tests) | 121 KB |
WN18RR | 6.6 MB |
FB15k-237 | 21 MB |
Wikidata5M | 1.4 GB |
GloVe embeddings | 423 MB |
DBpedia-Entity | 1.3 GB |
Then use tar
to extract the files, e.g.
tar -xzvf WN18RR.tar.gz
Note that the KG-related files above contain both transductive and inductive splits. Transductive splits are commonly used to evaluate lookup-table methods like ComplEx, while inductive splits contain entities in the test set that are not present in the training set. Files with triples for the inductive case have the ind
prefix, e.g. ind-train.txt
.
2. Reproduce the experiments
Link prediction
To check that all dependencies are correctly installed, run a quick test on a small graph (this should take less than 1 minute on GPU):
./scripts/test-umls.sh
The following table is a adapted from our paper. The "Script" column contains the name of the script that reproduces the experiment for the corresponding model and dataset. For example, if you want to reproduce the results of BLP-TransE on FB15k-237, run
./scripts/blp-transe-fb15k237.sh
WN18RR | FB15k-237 | Wikidata5M | ||||
---|---|---|---|---|---|---|
Model | MRR | Script | MRR | Script | MRR | Script |
GlovE-BOW | 0.170 | glove-bow-wn18rr.sh | 0.172 | glove-bow-fb15k237.sh | 0.343 | glove-bow-wikidata5m.sh |
BE-BOW | 0.180 | bert-bow-wn18rr.sh | 0.173 | bert-bow-fb15k237.sh | 0.362 | bert-bow-wikidata5m.sh |
GloVe-DKRL | 0.115 | glove-dkrl-wn18rr.sh | 0.112 | glove-dkrl-fb15k237.sh | 0.282 | glove-dkrl-wikidata5m.sh |
BE-DKRL | 0.139 | bert-dkrl-wn18rr.sh | 0.144 | bert-dkrl-fb15k237.sh | 0.322 | bert-dkrl-wikidata5m.sh |
BLP-TransE | 0.285 | blp-transe-wn18rr.sh | 0.195 | blp-transe-fb15k237.sh | 0.478 | blp-transe-wikidata5m.sh |
BLP-DistMult | 0.248 | blp-distmult-wn18rr.sh | 0.146 | blp-distmult-fb15k237.sh | 0.472 | blp-distmult-wikidata5m.sh |
BLP-ComplEx | 0.261 | blp-complex-wn18rr.sh | 0.148 | blp-complex-fb15k237.sh | 0.489 | blp-complex-wikidata5m.sh |
BLP-SimplE | 0.239 | blp-simple-wn18rr.sh | 0.144 | blp-simple-fb15k237.sh | 0.493 | blp-simple-wikidata5m.sh |
Entity classification
After training for link prediction, a tensor of embeddings for all entities is computed and saved in a file with name ent_emb-[ID].pt
where [ID]
is the id of the experiment in the database (we use Sacred to manage experiments). Another file called ents-[ID].pt
contains entity identifiers for every row in the tensor of embeddings.
To ease reproducibility, we provide these tensors, which are required in the entity classification task. Click on the ID, download the file into the output
folder, and decompress it. An experiment can be reproduced using the following command:
python train.py node_classification with checkpoint=ID dataset=DATASET
where DATASET
is either WN18RR
or FB15k-237
. For example:
python train.py node_classification with checkpoint=199 dataset=WN18RR
WN18RR | FB15k-237 | |||
---|---|---|---|---|
Model | Acc. | ID | Acc. Bal. | ID |
GloVe-BOW | 55.3 | 219 | 34.4 | 293 |
BE-BOW | 60.7 | 218 | 28.3 | 296 |
GloVe-DKRL | 55.5 | 206 | 26.6 | 295 |
BE-DKRL | 48.8 | 207 | 30.9 | 294 |
BLP-TransE | 81.5 | 199 | 42.5 | 297 |
BLP-DistMult | 78.5 | 200 | 41.0 | 298 |
BLP-ComplEx | 78.1 | 201 | 38.1 | 300 |
BLP-SimplE | 83.0 | 202 | 45.7 | 299 |
Information retrieval
This task runs with a pre-trained model saved from the link prediction task. For example, if the model trained is blp
with transe
and it was saved as model.pt
, then run the following command to run the information retrieval task:
python retrieval.py with model=blp rel_model=transe \
checkpoint='output/model.pt'
Using your own data
If you have a knowledge graph where entities have textual descriptions, you can train a BLP model for the tasks of inductive link prediction, and entity classification (if you also have labels for entities).
To do this, add a new folder inside the data
folder (let's call it my-kg
). Store in it a file containing the triples in your KG. This should be a text file with one tab-separated triple per line (let's call it all-triples.tsv
).
To generate inductive splits, you can use data/utils.py
. If you run
python utils.py drop_entities --file=my-kg/all-triples.tsv
this will generate ind-train.tsv
, ind-dev.tsv
, ind-test.tsv
inside my-kg
(see Appendix A in our paper for details on how these are generated). You can then train BLP-TransE with
python train.py with dataset='my-kg'
Alternative implementations
- Contextual Knowledge Bases by Raphael Sourty