《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

Microsoft

Last update: Dec 13, 2022

Related tags

Deep Learning K-Adapter

Overview

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

This repository is the implementation of the paper "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters".

In the K-adapter paper, we present a flexible approach that supports continual knowledge infusion into large pre-trained models (e.g. RoBERTa in this work). We infuse factual knowledge and linguistic knowledge, and show that adapters for both kinds of knowledge work well on downstream tasks.

For more details, please check the latest version of the paper: https://arxiv.org/abs/2002.01808

Prerequisites

Python 3.6
PyTorch 1.3.1
tensorboardX
transformers

We use huggingface/transformers framework, the environment can be installed with:

conda create -n kadapter python=3.6

pip install -r requirements.txt

Pre-training Adapters

In the pre-training procedure, we train each knowledge-specific adapter on different pre-training tasks individually.

1. Process Dataset

./scripts/clean_T_REx.py: clean raw T-Rex dataset (32G), and save the cleaned T-Rex to JSON format
./scripts/create_subdataset-relation-classification.ipynb: create the dataset from T-REx for pre-training factual adapter on relation classification task. This sub-dataset can be found here.
refer to this code to get the dependency parsing dataset : create the dataset from Book Corpus for pre-training the linguistic adapter on dependency parsing task.

2. Factual Adapter

To pre-train fac-adapter, run

bash run_pretrain_fac-adapter.sh

3. Linguistic Adapter

To pre-train lin-adapter, run

bash run_pretrain_lin-adapter.sh

The pre-trained fac-adapter and lin-adapter models can be found here.

Fine-tuning on Downstream Tasks

Adapter Structure

The fac-adapter (lin-adapter) consists of two transformer layers (L=2, H=768, A = 12)
The RoBERTa layers where adapters plug in: 0,11,23 or 0,11,22
For using only single adapter
- Use the concatenation of the last hidden feature of RoBERTa and the last hidden feature of the adapter as the input representation for the task-specific layer.
For using combine adapter
- For each adapter, first concat the last hidden feature of RoBERTa and the last hidden feature of every adapter and feed into a linear layer separately, then concat the representations as input for task-specific layer.

About how to load pretrained RoBERTa and pretrained adapter

The pre-trained adapters are in ./pretrained_models/fac-adapter/pytorch_model.bin and ./pretrained_models/lin-adapter/pytorch_model.bin. For using only single adapter, for example, fac-adapter, then you can set the argument meta_fac_adaptermodel= and set meta_lin_adaptermodel=””. For using both adapters, just set the arguments meta_fac_adaptermodel and meta_lin_adaptermodel as the path of adapters.
The pretrained RoBERTa will be downloaded automaticly when you run the pipeline.

1. Entity Typing

1.1 OpenEntity

One single 16G P100

(1) run the pipeline

bash run_finetune_openentity_adapter.sh

(2) result

with fac-adapter dev: (0.7967123287671233, 0.7580813347236705, 0.7769169115682607) test: (0.7929708951125755, 0.7584033613445378, 0.7753020134228187)
with lin-adapter dev: (0.8071672354948806, 0.7398331595411888, 0.7720348204570185) test:(0.8001135718341851, 0.7400210084033614, 0.7688949522510232)
with fac-adapter + lin-adapter dev: (0.8001101321585903, 0.7575599582898853, 0.7782538832351366) test: (0.7899568034557235, 0.7627737226277372, 0.7761273209549072)

the results may vary when running on different machines, but should not differ too much. I just search results from per_gpu_train_batch_sizeh: [4, 8] lr: [1e-5, 5e-6], warmup[0,200,500,1000,1200], maybe you can change other parameters and see the results. For w/fac-adapter, the best performance is achieved at gpu_num=1, per_gpu_train_batch_size=4, lr=5e-6, warmup=500(it takes about 2 hours to get the best result running on singe 16G P100) For w/lin-adapter, the best performance is achieved at gpu_num=1, per_gpu_train_batch_size=4, lr=5e-6, warmup=1000(it takes about 2 hours to get the best result running on singe 16G P100)

(3) Data format

Add special token "@" before and after a certain entity, then the first @ is adopted to perform classification. 9 entity categories: ['entity', 'location', 'time', 'organization', 'object', 'event', 'place', 'person', 'group'], each entity can be classified to several of them or none of them. The output is represented as [0,1,1,0,1,0,0,0,0], 0 represents the entity does not belong to the type, while 1 belongs to.

1.2 FIGER

(1) run the pipeline

bash run_finetune_figer_adapter.sh

The detailed hyperparamerters are listed in the running script.

2. Relation Classification

4*16G P100

(1) run the pipeline

bash run_finetune_tacred_adapter.sh

(2) result

with fac-adapter
- 'dev': (0.6686945083853996, 0.7481604120676968, 0.7061989928807085)
- 'test': (0.693900391717963, 0.7458646616541353, 0.7189447746050153)
with lin-adapter
- 'dev': (0.6679165308118683, 0.7536791758646063, 0.7082108902333621),
- 'test': (0.6884615384615385, 0.7536842105263157, 0.7195979899497488)
with fac-adapter + lin-adapter
- 'dev': (0.6793893129770993, 0.7367549668874173, 0.7069102462271645)
- 'test': (0.7014245014245014, 0.7404511278195489, 0.7204096561814192)
the results may vary when running on different machines, but should not differ too much.
I just search results from per_gpu_train_batch_sizeh: [4, 8] lr: [1e-5, 5e-6], warmup[0,200,1000,1200], maybe you can change other parameters and see the results.
The best performance is achieved at gpu_num=4, per_gpu_train_batch_size=8, lr=1e-5, warmup=200 (it takes about 7 hours to get the best result running on 4 16G P100)
The detailed hyperparamerters are listed in the running script.

(3) Data format

Add special token "@" before and after the first entity, add '#' before and after the second entity. Then the representations of @ and # are concatenated to perform relation classification.

3. Question Answering

3.1 CosmosQA

One single 16G P100

(1) run the pipeline

bash run_finetune_cosmosqa_adapter.sh

(2) result

CosmosQA dev accuracy: 80.9 CosmosQA test accuracy: 81.8

The best performance is achieved at gpu_num=1, per_gpu_train_batch_size=64, GRADIENT_ACC=32, lr=1e-5, warmup=0 (it takes about 8 hours to get the best result running on singe 16G P100) The detailed hyperparamerters are listed in the running script.

(3) Data format

For each answer, the input is contextquestionanswer, and will get a score for this answers. After getting four scores, we will select the answer with the highest score.

3.2 SearchQA and Quasar-T

The source codes for fine-tuning on SearchQA and Quasar-T dataset are modified based on the code of paper "Denoising Distantly Supervised Open-Domain Question Answering".

Use K-Adapter just like RoBERTa

You can use K-Adapter (RoBERTa with adapters) just like RoBERTa, which almost have the same inputs and outputs. Specifically, we add a class RobertawithAdapter in pytorch_transformers/my_modeling_roberta.py.
A demo code [run_example.sh and examples/run_example.py] about how to use “RobertawithAdapter”, do inference, save model and load model. You can leave the arguments of adapters as default.
Now it is very easy to use Roberta with adapters. If you only want to use single adapter, for example, fac-adapter, then you can set the argument meta_fac_adaptermodel='./pretrained_models/fac-adapter/pytorch_model.bin'' and set meta_lin_adaptermodel=””. If you want to use both adapters, just set the arguments meta_fac_adaptermodel and meta_lin_adaptermodel as the path of adapters.

bash run_example.sh

TODO

Remove and merge redundant codes
Support other pre-trained models, such as BERT...

Contact

Feel free to contact Ruize Wang ([email protected]) if you have any further questions.

Comments

had trouble running tacred evaluation on RTX3090

I tried to use the finetuned model to evaluate the results , and had one issue that says: Any idea on how i can get the missing files without training from scratch?

opened by Coopercoppers 0
What's the true data file of opentity to be used?

I download the opentity dataset from its website and copy ./cloud in it to /data. However ,when I try to "bash run_finetune_openentity_adapter.sh",I got an Error:"json.decoder.JSONDecodeError: Extra data“

opened by AQA6666 1
What is the purpose of having `negative_samples` being set to 45,000 as an argument? This also causes some data samples to be discarded.

Just curious what the purpose of this would be, as I don't think I've seen it before and would like to know what the motivation is. This causes many data sample to be discarded when creating the examples. Thanks.

opened by seanswyi 0
Is there any reason why you commented out the evaluation code for fine-tuning TACRED?

Just curious, because I tried running the code but realized evaluation wasn't taking place. Uncommenting the evaluation block also leads to an error and I'm wondering if that's to be expected.

opened by seanswyi 0
Best hyperparameters for the figer entity typing task

Hi, thanks for your great work! I wonder why the hyperparameters inside the run_finetune_figer_adapter.sh script differs from the best hyperparameters you mentioned in the paper's supplementary section?

opened by soroushjavdan 0
Incorrect pretraining data format for Factual Adapter
I have followed the code here and generate all 3 tsv files under DisExtract/data/books/ALL18_2019jan02_[valid, train, test].tsv. However the format is not aligned with the required json file to run pretraining for Factual Adapter. The format of the tsv is also different than the required json format as well.

The content format of generated tsv file after executing python producer.py is as follows:

[Sentence 1]\t[Sentence 2]\t[Marker] ...

The required json file format should be as follows:

{ "sent" : "Sentence 1", "tokens": "sentence 2", "pairs" : [ ... ] } ...

Is there a conversion script that convert generated tsv format to json?
opened by theblackcat102 1

《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

Related tags

Overview

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

Prerequisites

Pre-training Adapters

1. Process Dataset

2. Factual Adapter

3. Linguistic Adapter

Fine-tuning on Downstream Tasks

1. Entity Typing

1.1 OpenEntity

1.2 FIGER

2. Relation Classification

3. Question Answering

3.1 CosmosQA

3.2 SearchQA and Quasar-T

Use K-Adapter just like RoBERTa

TODO

Contact

Comments

had trouble running tacred evaluation on RTX3090

What's the true data file of opentity to be used?

What is the purpose of having `negative_samples` being set to 45,000 as an argument? This also causes some data samples to be discarded.

Is there any reason why you commented out the evaluation code for fine-tuning TACRED?

Best hyperparameters for the figer entity typing task

Incorrect pretraining data format for Factual Adapter

Owner

Microsoft

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Monocular Depth Estimation - Weighted-average prediction from multiple pre-trained depth estimation models

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Implementation of PyTorch-based multi-task pre-trained models

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Pre-trained Deep Learning models and demos (high quality and extremely fast)

Pytorch implementation of MLP-Mixer with loading pre-trained models.