Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Overview

K-BERT

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework.

Requirements

Software:

Python3
Pytorch >= 1.0
argparse == 1.1

Prepare

  • Download the google_model.bin from here, and save it to the models/ directory.
  • Download the CnDbpedia.spo from here, and save it to the brain/kgs/ directory.
  • Optional - Download the datasets for evaluation from here, unzip and place them in the datasets/ directory.

The directory tree of K-BERT:

K-BERT
├── brain
│   ├── config.py
│   ├── __init__.py
│   ├── kgs
│   │   ├── CnDbpedia.spo
│   │   ├── HowNet.spo
│   │   └── Medical.spo
│   └── knowgraph.py
├── datasets
│   ├── book_review
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│   ├── chnsenticorp
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│    ...
│
├── models
│   ├── google_config.json
│   ├── google_model.bin
│   └── google_vocab.txt
├── outputs
├── uer
├── README.md
├── requirements.txt
├── run_kbert_cls.py
└── run_kbert_ner.py

K-BERT for text classification

Classification example

Run example on Book review with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_cls.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/book_review/train.tsv \
    --dev_path ./datasets/book_review/dev.tsv \
    --test_path ./datasets/book_review/test.tsv \
    --epochs_num 5 --batch_size 32 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin \
    > ./outputs/kbert_bookreview_CnDbpedia.log &

Results:

Best accuracy in dev : 88.80%
Best accuracy in test: 87.69%

Options of run_kbert_cls.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph, "HowNet", "CnDbpedia" or "Medical".
        [--output_model_path] - Path to the output model.

Classification benchmarks

Accuracy (dev/test %) on different dataset:

Dataset HowNet CnDbpedia
Book review 88.75/87.75 88.80/87.69
ChnSentiCorp 95.00/95.50 94.42/95.25
Shopping 97.01/96.92 96.94/96.73
Weibo 98.22/98.33 98.29/98.33
LCQMC 88.97/87.14 88.91/87.20
XNLI 77.11/77.07 76.99/77.43

K-BERT for named entity recognization (NER)

NER example

Run an example on the msra_ner dataset with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_ner.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/msra_ner/train.tsv \
    --dev_path ./datasets/msra_ner/dev.tsv \
    --test_path ./datasets/msra_ner/test.tsv \
    --epochs_num 5 --batch_size 16 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_msraner_CnDbpedia.bin \
    > ./outputs/kbert_msraner_CnDbpedia.log &

Results:

The best in dev : precision=0.957, recall=0.962, f1=0.960
The best in test: precision=0.953, recall=0.959, f1=0.956

Options of run_kbert_ner.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph.
        [--output_model_path] - Path to the output model.

K-BERT for domain-specific tasks

Experimental results on domain-specific tasks (Precision/Recall/F1 %):

KG Finance_QA Law_QA Finance_NER Medicine_NER
HowNet 0.805/0.888/0.845 0.842/0.903/0.871 0.860/0.888/0.874 0.935/0.939/0.937
CN-DBpedia 0.814/0.881/0.846 0.814/0.942/0.874 0.860/0.887/0.873 0.935/0.937/0.936
MedicalKG -- -- -- 0.944/0.943/0.944

Acknowledgement

This work is a joint study with the support of Peking University and Tencent Inc.

If you use this code, please cite this paper:

@inproceedings{weijie2019kbert,
  title={{K-BERT}: Enabling Language Representation with Knowledge Graph},
  author={Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang},
  booktitle={Proceedings of AAAI 2020},
  year={2020}
}
Comments
  • 关于知识噪声的问题

    关于知识噪声的问题

    你好,我发现K-bert目前的entity linking的方式比较直接,并不会对有多种含义的实体进行区别,在实际使用过程中这往往会带来噪声,特别是一些大的知识库,有高频词带低频标签的情况。 比如用到过一个百科知识库里,有 舒适 → 作家名 这样的实体映射,也有大量一词多义的现象,单纯从word matching选择实体很容易将大量的噪声带入。 在实际使用中我发现特别是inference过程,大量是因为这类噪声知识的引入,反而让模型错判。

    1 针对目前直接查表的entity linking,请问有没有合适的降噪思路? 2 我看到作者提到“知识驱动”型任务,在微观层面是对每个识别的实体判断是否需要引入知识,请问这方面有没有什么思路或者可以借鉴的前沿研究可以参考? 比如早年比较火的attention机制,有没有可能引入类似selective attention的方法,来判断实体是否应该挂载知识以及挂载哪个知识,这类思路和k-bert的知识挂载方式有没有结合点?

    opened by tonyqtian 5
  • About the contents of spo

    About the contents of spo

    Dear @autoliuweijie,

    thank you so much for your study. I would like to experiment K-BERT for a custom graphs. However, I need to convert to spo format. I checked the examples, which are in Chinese. So I couldn't understand the contents and format of spo. I would appreciate it if you could give me an example in English.

    Kind regards Ipek

    opened by isspek 2
  • book_review数据集复现效果不佳

    book_review数据集复现效果不佳

    您好,感谢您开源代码,我使用您的代码和您分享的book_review数据集进行试验,但是准确率只有79%,比您的结果低了快10个百分点。想问问您有没有什么建议呢? 我的运行指令为:nohup python3 -u run_kbert_cls.py --train_path /data/book_review/train.tsv --dev_path /data/book_review/dev.tsv --test_path /data/book_review/test.tsv --epochs_num 5 --batch_size 32 --kg_name CnDbpedia --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin > ./outputs/kbert_bookreview_CnDbpedia.log & (谷歌预训练模型和词表等文件路径我在程序中指定,max_length是默认的256) 我的结果为:79% image

    opened by NovemberSun 2
  • Where is the NLPCC-DBQA?

    Where is the NLPCC-DBQA?

    Hi, I am interested in this paper, while I can not find the NLPCC-DBQA dataset from your source codes. Could you show the results of the NLPCC-DBQA from your codes? Thanks.

    opened by yuweijiang 1
  • Ask for help about your article

    Ask for help about your article

    Dear Dr.Liu:
    I am a master's student at ISCAS, My research is focused on EMR information extraction. I have recently read your paper "K-BERT: Enabling Language Representation with Knowledge Graph" and got a lot of inspiration from it, thank you! I am wondering if you could kindly send me the slides for presentation or other materials about it. I promise they will be used only for research purposed. My email:[email protected] Yours sincerely, Wenwen Xu

    opened by LinMu7177 1
  • 错误求教:在导入文本时代码为什么能直接把数据的label直接用int进行强制类型转换?

    错误求教:在导入文本时代码为什么能直接把数据的label直接用int进行强制类型转换?

    Vocabulary file line 344 has bad format token Vocabulary Size: 21128 [BertClassifier] use visible_matrix: True [KnowledgeGraph] Loading spo from /home/schen/K-BERT/brain/kgs/Medical.spo Start training. Loading sentences from ./datasets/medical_ner/train.tsv There are 6919 sentence in total. We use 1 processes to inject knowledge into sentences. {'text_a': 0, 'label': 1} Progress of process 0: 0/6919 ['山 , 男 , 7 3 岁 , 汉 族 , 已 婚 , 现 住 双 滦 区 陈 栅 子 乡 太 阳 沟 村 。', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O']

    Traceback (most recent call last): File "run_kbert_cls.py", line 582, in main() File "run_kbert_cls.py", line 501, in main trainset = read_dataset(args.train_path, workers_num=args.workers_num) File "run_kbert_cls.py", line 329, in read_dataset dataset = add_knowledge_worker(params) File "run_kbert_cls.py", line 84, in add_knowledge_worker label = int(line[columns["label"]]) ValueError: invalid literal for int() with base 10: 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O'

    opened by Jennifer1996 0
  • ZeroDivisionError: division by zero

    ZeroDivisionError: division by zero

    Hello. I run the model with my custom dataset. However, it returns "ZeroDivisionError" as follows. What the most possible troubleshoot for this? Is there something wrong with the dataset? Thank you.

    Screenshot 2022-12-12 at 05 59 34
    opened by ariefpurnamamuharram 2
  • Where is the -inf condition being enforced ?

    Where is the -inf condition being enforced ?

    Hello, How are you enforcing the -inf condition if the two words are not in the same branch ? In the code you are setting both the places as 1, bit shouldn't it be 0 and -inf ?

    Calculate Visible Matrix

            visible_matrix = np.zeros((token_num,token_num))
            for item in abs_idx_tree:
                src_ids = item[0]
                for id in src_ids:
                    visible_abs_idx = abs_idx_src + [idx for ent in item[1] for idx in ent]
                    #print(visible_abs_idx)
                    visible_matrix[id,visible_abs_idx] = 1 
                for ent in item[1]:
                    for id in ent:
                        visible_abs_idx = ent + src_ids
                        visible_matrix[id,visible_abs_idx] = 1
    
    opened by swarnadeep8597 0
  • 关于更换基础模型的问题

    关于更换基础模型的问题

    您好,首先很感谢您对代码的分享!近期我在研究使用知识图谱进行文本分类的工作,在使用您的模型时,我发现三分类任务的性能相比二分类差距较大,因此想尝试更换更大的模型如bert-large。我想咨询一下,是不是下载bert-large之后下载uer/bert/bert-large-config就可以使用了?如果有必要的话,我还需要执行哪些操作以实现我的目的?

    opened by kxy-cheng 0
  • pre-training corpus

    pre-training corpus

    Hello @autoliuweijie, thank you for your amazing and inspiring work!

    I would like to pre-train a K-Bert model on an english language corpus and to make it work I am currently trying to get the function in train_and_validate() to run, with args.target set to "bert". I notice that with this setting, BertDataLoader will be used for loading the data, but I am not sure what exact format the dataset file at dataset_path has to be. From the code, I see that it has to be pickle file, but I am having trouble trying to reconstruct one that works with the data loader.

    It would be very helpful to have access to the data file originally used for pre-training. Could you provide a link or instructions on how to construct it myself?

    opened by Humorloos 0
  • 是否可以多分类

    是否可以多分类

    您好,我正在复现您的代码,请问对文本分类任务是只能实现二分类还是可以实现多分类呢。我用自己的多类数据训练时报错: Traceback (most recent call last): File "run_kbert_cls.py", line 578, in main() File "run_kbert_cls.py", line 557, in main result = evaluate(args, False) File "run_kbert_cls.py", line 393, in evaluate print("Acc. (Correct/Total): {:.4f} ({}/{}) ".format(correct/len(dataset), correct, len(dataset))) ZeroDivisionError: division by zero

    opened by buthi 3
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

null 1 Nov 2, 2021
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 9, 2023
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 7k Feb 18, 2021
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 2, 2023
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

null 55 Nov 22, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 184 Feb 10, 2021
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me.

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. >>> import chazutsu >>> r = chazutsu.data

chakki 243 Dec 29, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 2, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021