Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Weijie Liu

Last update: Jan 9, 2023

Related tags

Overview

K-BERT

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework.

Requirements

Software:

Python3
Pytorch >= 1.0
argparse == 1.1

Prepare

Download the google_model.bin from here, and save it to the models/ directory.
Download the CnDbpedia.spo from here, and save it to the brain/kgs/ directory.
Optional - Download the datasets for evaluation from here, unzip and place them in the datasets/ directory.

The directory tree of K-BERT:

K-BERT
├── brain
│   ├── config.py
│   ├── __init__.py
│   ├── kgs
│   │   ├── CnDbpedia.spo
│   │   ├── HowNet.spo
│   │   └── Medical.spo
│   └── knowgraph.py
├── datasets
│   ├── book_review
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│   ├── chnsenticorp
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│    ...
│
├── models
│   ├── google_config.json
│   ├── google_model.bin
│   └── google_vocab.txt
├── outputs
├── uer
├── README.md
├── requirements.txt
├── run_kbert_cls.py
└── run_kbert_ner.py

K-BERT for text classification

Classification example

Run example on Book review with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_cls.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/book_review/train.tsv \
    --dev_path ./datasets/book_review/dev.tsv \
    --test_path ./datasets/book_review/test.tsv \
    --epochs_num 5 --batch_size 32 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin \
    > ./outputs/kbert_bookreview_CnDbpedia.log &

Results:

Best accuracy in dev : 88.80%
Best accuracy in test: 87.69%

Options of run_kbert_cls.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph, "HowNet", "CnDbpedia" or "Medical".
        [--output_model_path] - Path to the output model.

Classification benchmarks

Accuracy (dev/test %) on different dataset:

Dataset	HowNet	CnDbpedia
Book review	88.75/87.75	88.80/87.69
ChnSentiCorp	95.00/95.50	94.42/95.25
Shopping	97.01/96.92	96.94/96.73
Weibo	98.22/98.33	98.29/98.33
LCQMC	88.97/87.14	88.91/87.20
XNLI	77.11/77.07	76.99/77.43

K-BERT for named entity recognization (NER)

NER example

Run an example on the msra_ner dataset with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_ner.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/msra_ner/train.tsv \
    --dev_path ./datasets/msra_ner/dev.tsv \
    --test_path ./datasets/msra_ner/test.tsv \
    --epochs_num 5 --batch_size 16 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_msraner_CnDbpedia.bin \
    > ./outputs/kbert_msraner_CnDbpedia.log &

Results:

The best in dev : precision=0.957, recall=0.962, f1=0.960
The best in test: precision=0.953, recall=0.959, f1=0.956

Options of run_kbert_ner.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph.
        [--output_model_path] - Path to the output model.

K-BERT for domain-specific tasks

Experimental results on domain-specific tasks (Precision/Recall/F1 %):

KG	Finance_QA	Law_QA	Finance_NER	Medicine_NER
HowNet	0.805/0.888/0.845	0.842/0.903/0.871	0.860/0.888/0.874	0.935/0.939/0.937
CN-DBpedia	0.814/0.881/0.846	0.814/0.942/0.874	0.860/0.887/0.873	0.935/0.937/0.936
MedicalKG	--	--	--	0.944/0.943/0.944

Acknowledgement

This work is a joint study with the support of Peking University and Tencent Inc.

If you use this code, please cite this paper:

@inproceedings{weijie2019kbert,
  title={{K-BERT}: Enabling Language Representation with Knowledge Graph},
  author={Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang},
  booktitle={Proceedings of AAAI 2020},
  year={2020}
}

Comments

关于知识噪声的问题

你好，我发现K-bert目前的entity linking的方式比较直接，并不会对有多种含义的实体进行区别，在实际使用过程中这往往会带来噪声，特别是一些大的知识库，有高频词带低频标签的情况。比如用到过一个百科知识库里，有舒适 → 作家名这样的实体映射，也有大量一词多义的现象，单纯从word matching选择实体很容易将大量的噪声带入。在实际使用中我发现特别是inference过程，大量是因为这类噪声知识的引入，反而让模型错判。

1 针对目前直接查表的entity linking，请问有没有合适的降噪思路？ 2 我看到作者提到“知识驱动”型任务，在微观层面是对每个识别的实体判断是否需要引入知识，请问这方面有没有什么思路或者可以借鉴的前沿研究可以参考？比如早年比较火的attention机制，有没有可能引入类似selective attention的方法，来判断实体是否应该挂载知识以及挂载哪个知识，这类思路和k-bert的知识挂载方式有没有结合点？

opened by tonyqtian 5
About the contents of spo

Dear @autoliuweijie,

thank you so much for your study. I would like to experiment K-BERT for a custom graphs. However, I need to convert to spo format. I checked the examples, which are in Chinese. So I couldn't understand the contents and format of spo. I would appreciate it if you could give me an example in English.

Kind regards Ipek

opened by isspek 2
book_review数据集复现效果不佳

您好，感谢您开源代码，我使用您的代码和您分享的book_review数据集进行试验，但是准确率只有79%，比您的结果低了快10个百分点。想问问您有没有什么建议呢？我的运行指令为：nohup python3 -u run_kbert_cls.py --train_path /data/book_review/train.tsv --dev_path /data/book_review/dev.tsv --test_path /data/book_review/test.tsv --epochs_num 5 --batch_size 32 --kg_name CnDbpedia --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin > ./outputs/kbert_bookreview_CnDbpedia.log & (谷歌预训练模型和词表等文件路径我在程序中指定，max_length是默认的256）我的结果为：79%

opened by NovemberSun 2
Where is the NLPCC-DBQA?

Hi, I am interested in this paper, while I can not find the NLPCC-DBQA dataset from your source codes. Could you show the results of the NLPCC-DBQA from your codes? Thanks.

opened by yuweijiang 1
Ask for help about your article

Dear Dr.Liu:
I am a master's student at ISCAS, My research is focused on EMR information extraction. I have recently read your paper "K-BERT: Enabling Language Representation with Knowledge Graph" and got a lot of inspiration from it, thank you! I am wondering if you could kindly send me the slides for presentation or other materials about it. I promise they will be used only for research purposed. My email：[email protected] Yours sincerely, Wenwen Xu

opened by LinMu7177 1
错误求教：在导入文本时代码为什么能直接把数据的label直接用int进行强制类型转换？

Vocabulary file line 344 has bad format token Vocabulary Size: 21128 [BertClassifier] use visible_matrix: True [KnowledgeGraph] Loading spo from /home/schen/K-BERT/brain/kgs/Medical.spo Start training. Loading sentences from ./datasets/medical_ner/train.tsv There are 6919 sentence in total. We use 1 processes to inject knowledge into sentences. {'text_a': 0, 'label': 1} Progress of process 0: 0/6919 ['山，男， 7 3 岁，汉族，已婚，现住双滦区陈栅子乡太阳沟村。', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O']

Traceback (most recent call last): File "run_kbert_cls.py", line 582, in main() File "run_kbert_cls.py", line 501, in main trainset = read_dataset(args.train_path, workers_num=args.workers_num) File "run_kbert_cls.py", line 329, in read_dataset dataset = add_knowledge_worker(params) File "run_kbert_cls.py", line 84, in add_knowledge_worker label = int(line[columns["label"]]) ValueError: invalid literal for int() with base 10: 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O'

opened by Jennifer1996 0
ZeroDivisionError: division by zero

Hello. I run the model with my custom dataset. However, it returns "ZeroDivisionError" as follows. What the most possible troubleshoot for this? Is there something wrong with the dataset? Thank you.

opened by ariefpurnamamuharram 2

Where is the -inf condition being enforced ?

Hello, How are you enforcing the -inf condition if the two words are not in the same branch ? In the code you are setting both the places as 1, bit shouldn't it be 0 and -inf ?

Calculate Visible Matrix

        visible_matrix = np.zeros((token_num,token_num))
        for item in abs_idx_tree:
            src_ids = item[0]
            for id in src_ids:
                visible_abs_idx = abs_idx_src + [idx for ent in item[1] for idx in ent]
                #print(visible_abs_idx)
                visible_matrix[id,visible_abs_idx] = 1 
            for ent in item[1]:
                for id in ent:
                    visible_abs_idx = ent + src_ids
                    visible_matrix[id,visible_abs_idx] = 1

opened by swarnadeep8597 0

关于更换基础模型的问题

您好，首先很感谢您对代码的分享！近期我在研究使用知识图谱进行文本分类的工作，在使用您的模型时，我发现三分类任务的性能相比二分类差距较大，因此想尝试更换更大的模型如bert-large。我想咨询一下，是不是下载bert-large之后下载uer/bert/bert-large-config就可以使用了？如果有必要的话，我还需要执行哪些操作以实现我的目的？

opened by kxy-cheng 0
pre-training corpus

Hello @autoliuweijie, thank you for your amazing and inspiring work!

I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate() to run, with args.target set to "bert". I notice that with this setting, BertDataLoader will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.

It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?

opened by Humorloos 0
是否可以多分类

您好，我正在复现您的代码，请问对文本分类任务是只能实现二分类还是可以实现多分类呢。我用自己的多类数据训练时报错： Traceback (most recent call last): File "run_kbert_cls.py", line 578, in main() File "run_kbert_cls.py", line 557, in main result = evaluate(args, False) File "run_kbert_cls.py", line 393, in evaluate print("Acc. (Correct/Total): {:.4f} ({}/{}) ".format(correct/len(dataset), correct, len(dataset))) ZeroDivisionError: division by zero

opened by buthi 3

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Related tags

Overview

K-BERT

Requirements

Prepare

K-BERT for text classification

Classification example

Classification benchmarks

K-BERT for named entity recognization (NER)

NER example

K-BERT for domain-specific tasks

Acknowledgement

Comments

Calculate Visible Matrix

Owner

Weijie Liu

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

This repository contains the code for "Generating Datasets with Pretrained Language Models".

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Task-based datasets, preprocessing, and evaluation for sequence models.

Codes for processing meeting summarization datasets AMI and ICSI.

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Datasets of Automatic Keyphrase Extraction

The tool to make NLP datasets ready to use

Index different CKAN entities in Solr, not just datasets

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself