LUKE -- Language Understanding with Knowledge-based Embeddings

Related tags

Deep Learning luke
Overview

LUKE

CircleCI


LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. It was proposed in our paper LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. It achieves state-of-the-art results on important NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).

This repository contains the source code to pre-train the model and fine-tune it to solve downstream tasks.

News

November 24, 2021: Entity disambiguation example is available

The example code of entity disambiguation based on LUKE has been added to this repository. This model was originally proposed in our paper, and achieved state-of-the-art results on five standard entity disambiguation datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI.

For further details, please refer to the example directory.

August 3, 2021: New example code based on Hugging Face Transformers and AllenNLP is available

New fine-tuning examples of three downstream tasks, i.e., NER, relation classification, and entity typing, have been added to LUKE. These examples are developed based on Hugging Face Transformers and AllenNLP. The fine-tuning models are defined using simple AllenNLP's Jsonnet config files!

The example code is available in the examples_allennlp directory.

May 5, 2021: LUKE is added to Hugging Face Transformers

LUKE has been added to the master branch of the Hugging Face Transformers library. You can now solve entity-related tasks (e.g., named entity recognition, relation classification, entity typing) easily using this library.

For example, the LUKE-large model fine-tuned on the TACRED dataset can be used as follows:

>>> from transformers import LukeTokenizer, LukeForEntityPairClassification
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> text = "Beyoncé lives in Los Angeles."
>>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> predicted_class_idx = int(logits[0].argmax())
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
Predicted class: per:cities_of_residence

We also provide the following three Colab notebooks that show how to reproduce our experimental results on CoNLL-2003, TACRED, and Open Entity datasets using the library:

Please refer to the official documentation for further details.

November 5, 2021: LUKE-500K (base) model

We released LUKE-500K (base), a new pretrained LUKE model which is smaller than existing LUKE-500K (large). The experimental results of the LUKE-500K (base) and LUKE-500K (large) on SQuAD v1 and CoNLL-2003 are shown as follows:

Task Dataset Metric LUKE-500K (base) LUKE-500K (large)
Extractive Question Answering SQuAD v1.1 EM/F1 86.1/92.3 90.2/95.4
Named Entity Recognition CoNLL-2003 F1 93.3 94.3

We tuned only the batch size and learning rate in the experiments based on LUKE-500K (base).

Comparison with State-of-the-Art

LUKE outperforms the previous state-of-the-art methods on five important NLP tasks:

Task Dataset Metric LUKE-500K (large) Previous SOTA
Extractive Question Answering SQuAD v1.1 EM/F1 90.2/95.4 89.9/95.1 (Yang et al., 2019)
Named Entity Recognition CoNLL-2003 F1 94.3 93.5 (Baevski et al., 2019)
Cloze-style Question Answering ReCoRD EM/F1 90.6/91.2 83.1/83.7 (Li et al., 2019)
Relation Classification TACRED F1 72.7 72.0 (Wang et al. , 2020)
Fine-grained Entity Typing Open Entity F1 78.2 77.6 (Wang et al. , 2020)

These numbers are reported in our EMNLP 2020 paper.

Installation

LUKE can be installed using Poetry:

$ poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Released Models

We initially release the pre-trained model with 500K entity vocabulary based on the roberta.large model.

Name Base Model Entity Vocab Size Params Download
LUKE-500K (base) roberta.base 500K 253 M Link
LUKE-500K (large) roberta.large 500K 483 M Link

Reproducing Experimental Results

The experiments were conducted using Python3.6 and PyTorch 1.2.0 installed on a server with a single or eight NVidia V100 GPUs. We used NVidia's PyTorch Docker container 19.02. For computational efficiency, we used mixed precision training based on APEX library which can be installed as follows:

$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

The APEX library is not needed if you do not use --fp16 option or reproduce the results based on the trained checkpoint files.

The commands that reproduce the experimental results are provided as follows:

Entity Typing on Open Entity Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=2 \
    --learning-rate=1e-5 \
    --num-train-epochs=3 \
    --fp16

Relation Classification on TACRED Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=4 \
    --gradient-accumulation-steps=8 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Named Entity Recognition on CoNLL-2003 Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli\
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Cloze-style Question Answering on ReCoRD Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=1 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=2 \
    --fp16

Extractive Question Answering on SQuAD 1.1 Dataset

Dataset: Link
Checkpoint file (compressed): Link
Wikipedia data files (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --train-batch-size=2 \
    --gradient-accumulation-steps=3 \
    --learning-rate=15e-6 \
    --num-train-epochs=2 \
    --fp16

Citation

If you use LUKE in your work, please cite the original paper:

@inproceedings{yamada2020luke,
  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},
  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},
  booktitle={EMNLP},
  year={2020}
}

Contact Info

Please submit a GitHub issue or send an e-mail to Ikuya Yamada ([email protected]) for help or issues using LUKE.

Comments
  • Adding LUKE to HuggingFace Transformers

    Adding LUKE to HuggingFace Transformers

    Hi, Is there a possibility to reproduce results for NER on CPU instead of the default GPU configuration? I am unable to find any resource for this on the repo.

    I am using the following command, but there seems to be no flag/argument available to switch between CPU and GPU?

    python -m examples.cli --model-file=luke_large_500k.tar.gz --output-dir=<OUTPUT_DIR> ner run --data-dir=<DATA_DIR> --fp16 --train-batch-size=2 --gradient-accumulation-steps=2 --learning-rate=1e-5 --num-train-epochs=5
    

    Thanks in advance!

    opened by uahmad235 46
  • Pretraining instruction

    Pretraining instruction

    Hi authors,

    Awesome work! Thanks for your codes and instructions. Recently, I want to pretrain a new Luke model on my own dataset. Could you write a pretraining instruction so I can learn? Thank you!

    opened by JiachengLi1995 19
  • the result of squad

    the result of squad

    I use poetry to build experiment environment, and I try to reproduce your paper performance follow your advise, bug failed. EM F1 paper 90.2 95.4 your checkpoint 89.76 94.97 your finetune 89.04 94.69

    do you know the reason?

    opened by TingFree 16
  • Two questions: 1.Release entity vocab's wikipedia pageid? 2. Does [mask] occupy bert's 512 input?

    Two questions: 1.Release entity vocab's wikipedia pageid? 2. Does [mask] occupy bert's 512 input?

    1. Right now, some titles in entity vocab can not align to a unique wikipedia paged or wikidata entity id: some are missing, and some titles refer to a same pageid. Can you release the mapping between entity vocab's title and wikipedia pageid / wikidata entity id?
    2. It seems that [mask]s for span representation don't occupy bert's 512 input? For example, I have a sequence with 512 tokens, and I want to use LUKE to extract 10 spans, then I can input 512 tokens+10 [mask], rather than 502 tokens + 10 [mask]? (as long as [mask] position embedding is correctly aligned to the 10 spans?)
    opened by dalek-who 13
  • Getting RuntimeError for LukeRelationClassification

    Getting RuntimeError for LukeRelationClassification

    While trying to replicate results using pre-trained model for Relation Classification, I am getting the following error. I looked at the function load_state_dict(), strict argument is set to False.

    Traceback (most recent call last):
    
    
      File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
    
      File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
    
      File "/home/akshay/re_rc/luke/examples/cli.py", line 132, in <module>
        cli()
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
    
      File "/home/akshay/re_rc/luke/examples/utils/trainer.py", line 32, in wrapper
        return func(*args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/decorators.py", line 33, in new_func
        return f(get_current_context().obj, *args, **kwargs)
    
      File "/home/akshay/re_rc/luke/examples/relation_classification/main.py", line 110, in run
        model.load_state_dict(torch.load(args.checkpoint_file, map_location="cpu"))
    
      File "/home/akshay/re_rc/luke/luke/model.py", line 236, in load_state_dict
        super(LukeEntityAwareAttentionModel, self).load_state_dict(new_state_dict, *args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
        self.__class__.__name__, "\n\t".join(error_msgs)))
    
    RuntimeError: Error(s) in loading state_dict for LukeForRelationClassification:
    	size mismatch for embeddings.word_embeddings.weight: copying a param with shape torch.Size([50266, 1024]) from checkpoint, the shape in current model is torch.Size([50267, 1024]).
    
    	size mismatch for entity_embeddings.entity_embeddings.weight: copying a param with shape torch.Size([2, 256]) from checkpoint, the shape in current model is torch.Size([3, 256]).
    

    I cannot understand the reason behind this. Can somebody please explain!

    opened by akshayparakh25 13
  • LUKE Large Finetuning Duration for NER

    LUKE Large Finetuning Duration for NER

    Hello there,

    I'm trying to finetune LUKE large for NER task using data with conll format from multiple sources including conll2003 itself, I've combined them together. The training is being done on Google Colab GPU, but it takes so long to finish a single batch, it took about 2 hours to train on just 2 batches with a batch size of 2, is this expected? And if not, why does this happen?

    Thanks in advance

    opened by taghreed34 12
  • Pretraining for a Different Language and Testing Pretrained Model for Entity Disambiguation

    Pretraining for a Different Language and Testing Pretrained Model for Entity Disambiguation

    Hi,

    I have pretrained an Entity Disambiguation model with the recent instructions of pretraining Luke. From the instructions shared here and in another issue #115 , I was able to perform two-step pretraining for Turkish.

    I am sharing the commands and config files I have used in order to make sure that nothing was non-logical. I have run the following code for the first stage with the configuration setup below; Code:

    deepspeed \
    --num_gpus=6 luke/pretraining/train.py \
    --output-dir=training_on_turkish/luke-bert-base-turkish-first-stage \
    --deepspeed-config-file=pretraining_config/luke_base_stage1.json \
    --dataset-dir=training_on_turkish/tr_pretraining_dataset \
    --bert-model-name=dbmdz/bert-base-turkish-uncased  \
    --num-epochs=1 \
    --fix-bert-weights \
    --masked-entity-prob=0.30 \
    --masked-lm-prob=0
    

    Config:

    {
      "train_batch_size": 24,
      "train_micro_batch_size_per_gpu": 4,
      "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 5e-4,
          "betas": [0.9, 0.999],
          "eps": 1e-6,
          "weight_decay": 0.01,
          "bias_correction": false
        }
      },
      "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 5e-4,
          "warmup_num_steps": 1000,
          "total_num_steps": 192796,
          "warmup_type": "linear"
        }
      },
      "gradient_clipping": 10000.0
    }
    

    And the second line of code with the configuration setup is below; Command:

    deepspeed
    --num_gpus=6 luke/pretraining/train.py \ 
    --output-dir=training_on_turkish/luke-bert-base-turkish-second-stage \
    --deepspeed-config-file=pretraining_config/luke_base_stage2.json \
    --dataset-dir=training_on_turkish/tr_pretraining_dataset/ \
    --bert-model-name=dbmdz/bert-base-turkish-uncased \
    --num-epochs=5 \
    --reset-optimization-states \
    --resume-checkpoint-id=training_on_turkish/luke-bert-base-turkish-first-stage/checkpoints/epoch1/ \
    --masked-entity-prob=0.30 \
    --masked-lm-prob=0
    

    Config:

    {
      "train_batch_size": 24,
      "train_micro_batch_size_per_gpu": 4,
      "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 1e-5,
          "betas": [0.9, 0.999],
          "eps": 1e-6,
          "weight_decay": 0.01,
          "bias_correction": false
        }
      },
      "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 1e-5,
          "warmup_num_steps": 2500,
          "total_num_steps": 98168,
          "warmup_type": "linear"
        }
      },
      "gradient_clipping": 10000.0
    }
    

    I guess now I have a model that can perform Entity Disambiguation? The question is that I cannot see any clear example for performing or evaluating a pretrained model on Entity Disambiguation. How the data should be formatted? How should I call the model in order to make predictions?

    Thank you for your time.

    opened by fatihbeyhan 12
  • Assertion error with CONLL03

    Assertion error with CONLL03

    Hi, here I met another problem when using luke on NER dataset CONLL03... When creating features from examples, the variable entity_labels is empty at some examples, like train-945:

    guid=train-945
    words=['SOCCER', '-', 'ENGLISH', 'SOCCER', 'RESULTS', '.', 'LONDON', '1996-08-30', 'Results', 'of', 'English', 'league', 'matches', 'on', 'Friday', ':', 'Division', 'two', 'Plymouth', '2', 'Preston', '1', 'Division', 'three', 'Swansea', '1', 'Lincoln', '2']
    labels=['O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O']
    

    and the code here throws an AssertionError: https://github.com/studio-ousia/luke/blob/9323b216dd5f72b61545bc4133f7709fd19bfa95/examples/ner/utils.py#L239 Do you have any idea what's wrong with these examples?

    opened by Riroaki 12
  • The meaning of entity_ids=[1, 0] when finetune OpenEntity dataset

    The meaning of entity_ids=[1, 0] when finetune OpenEntity dataset

    I am runing the fine-tuning process on OpenEntity, but I can not understand well the code below, why entity_ids=[1, 0] entity_ids = [1, 0] entity_attention_mask = [1, 0] entity_segment_ids = [0, 0] entity_position_ids = list(range(mention_start, mention_end))[:max_mention_length] entity_position_ids += [-1] * (max_mention_length - mention_end + mention_start) entity_position_ids = [entity_position_ids, [-1] * max_mention_length]

    opened by lshowway 10
  • Preparing Environment For Allen-nlp.

    Preparing Environment For Allen-nlp.

    Hi @ikuyamada , I am facing issues while creating environment setup for allennlp based ner and re solution. I come to know that it works only on python3.7 so I dockerized it with the requirements.txt file and then run the container. This time the package poetry was not found. I added it as an extra requirement and it starts running. For both allennlp based solutions, it errored out below: Screenshot from 2022-03-19 15-45-30 How to prepare environment for this solution?? Can I use higher versions of python for that??

    opened by elonmusk-01 9
  • Pretraining Problem

    Pretraining Problem

    Hi @ikuyamada,

    Thanks for your amazing work about this entity-aware language model. I am interested to build a LUKE model for the Indonesian language. Since I couldn't find any documentation about how to train the model, I did the following steps:

    1. Build the dump DB (build-dump-db)
    2. Build the entity vocab (build-entity-vocab)
    3. Build Wiki pretraining dataset (build-wikipedia-pretraining-dataset)
    4. Do the pretraining

    However, when starting to do the pretraining, I got some errors:

    Traceback (most recent call last):
      File "/usr/playground/luke/luke/pretraining/train.py", line 353, in run_pretraining
        result = model(**batch)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/playground/luke/luke/pretraining/model.py", line 81, in forward
        entity_attention_mask,
      File "/usr/playground/luke/luke/model.py", line 109, in forward
        entity_embedding_output = self.entity_embeddings(entity_ids, entity_position_ids, entity_segment_ids)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/playground/luke/luke/model.py", line 60, in forward
        entity_embeddings = self.entity_embedding_dense(entity_embeddings)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
        return F.linear(input, self.weight, self.bias)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
    [2020-12-09 00:43:15,490] [ERROR] Consecutive errors have been observed. Exiting... ([email protected]:379)
    Traceback (most recent call last):
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in run_pretraining
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in <dictcomp>
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
    RuntimeError: CUDA error: an illegal memory access was encountered
    Traceback (most recent call last):
      File "./luke/cli.py", line 67, in <module>
        cli()
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/usr/playground/luke/luke/pretraining/train.py", line 82, in pretrain
        run_pretraining(Namespace(**kwargs))
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in run_pretraining
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in <dictcomp>
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
    RuntimeError: CUDA error: an illegal memory access was encountered
    

    Meanwhile, when trying to run the code in cpu, I got this error: IndexError: index out of range in self

    Is it a cuda error or maybe because of the tensors?

    Thank you in advance for your help!

    Best, Oryza

    opened by khairunnisaor 9
  • Will luke support fast tokenizer

    Will luke support fast tokenizer

    Hello everyone, I am tring to use luke-large for question answering. I met serveral issues when finetune the model by SQAUD-like data, most of the issues comes by not supporting fast tokenizer. So I am wondering if luke will support fast tokenizer in the future, or is any ways to solve the issues. Thank you so much!

    opened by TrickyyH 1
  • Entity Mapping Preprocessing

    Entity Mapping Preprocessing

    Hi, first of all, thank you for the nice work.

    Let's take the below input example.

    "Everaldo has played for Guarani and Santa Cruz in the Campeonato Brasileiro, before moving to Mexico where he played for Chiapas and Necaxa." , entity: Guarani .

    When training the model through the input, [MASK] token is added for masking Guarani entity. Then, the model is trained by predicting [MASK] as Guarani through Cross Entropy Loss.

    However, when we analyze entity_vocab.json, there isn't "Guarani". The entity_vocab.json only have "Guarani language", "Guarani FC", "Tupi\u2013Guarani languages", "Guarani mythology". In that example, I believe that Guarani means Guarani FC.

    Therefore, is the model trained to predict [MASK] as Guarani FC? If yes, we need to let the model know Guarani means Guarani FC. And, I guess that we need to match Guarani with Guarani FC.

    The preprocessing in https://github.com/studio-ousia/luke/blob/master/pretraining.md, deals with such issues ?

    Thank you.

    opened by kimwongyuda 1
  • Replace Luke with MLuke in Notebook/ConLL-2003

    Replace Luke with MLuke in Notebook/ConLL-2003

    Hi!

    I'm trying to run MLuke on https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb by replacing studio-ousia/luke-large-finetuned-conll-2003 with studio-ousia/mluke-large-lite-finetuned-conll-2003 and changing LukeTokenizer to MLukeTokenizer. Every thing looks find until the block:

    batch_size = 2 all_logits = []

    for batch_start_idx in trange(0, len(test_examples), batch_size): batch_examples = test_examples[batch_start_idx:batch_start_idx + batch_size] texts = [example["text"] for example in batch_examples] entity_spans = [example["entity_spans"] for example in batch_examples]

    inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
    inputs = inputs.to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)
    all_logits.extend(outputs.logits.tolist())
    

    The error is

    AttributeError Traceback (most recent call last) Cell In [8], line 12 10 inputs = inputs.to("cuda") 11 with torch.no_grad(): ---> 12 outputs = model(**inputs) 13 all_logits.extend(outputs.logits.tolist())

    File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs) 1098 # If we don't have any hooks, we want to skip the rest of the logic in 1099 # this function, and just call forward. 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(*input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

    File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/transformers/models/luke/modeling_luke.py:1588, in LukeForEntitySpanClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, entity_start_positions, entity_end_positions, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1571 outputs = self.luke( 1572 input_ids=input_ids, 1573 attention_mask=attention_mask, (...) 1584 return_dict=True, 1585 ) 1586 hidden_size = outputs.last_hidden_state.size(-1) -> 1588 entity_start_positions = entity_start_positions.unsqueeze(-1).expand(-1, -1, hidden_size) 1589 start_states = torch.gather(outputs.last_hidden_state, -2, entity_start_positions) 1590 entity_end_positions = entity_end_positions.unsqueeze(-1).expand(-1, -1, hidden_size)

    AttributeError: 'NoneType' object has no attribute 'unsqueeze'

    Thank you.

    opened by mrpeerat 4
  • Training using HF Transformers

    Training using HF Transformers

    Hi authors, Thank you sharing this interesting piece of work. I was trying this model on custom NER dataset and compare it with other BERT variants. To that end, I was wondering if you could provide instructions on how to finetune this model on custom NER dataset and what should be the dataset format.

    Also, instructions on pre-training the base model (without any head) using unlabled corpus would be really really useful. I saw some instructions around pre-training using allennlp library but there is some friction there. Since HF now is fairly stable library and widely popular, would appreciate if you could provide instructions on using LUKE using HF.

    opened by NiteshMethani 0
  • Luke NER Fine Tuning on Custom Entities.

    Luke NER Fine Tuning on Custom Entities.

    Hi, I am trying to fine tune luke base model and prepared conll like dataset with two columns (one for tokens and other for labels). The training runs smoothly but asserts no entity label when trying to make predictions. Can I really know what changes in the code are supposed to be making luke ner solution useful for custom ner with different number of classes.

    opened by ahmadaii 12
  • Any plans for huggingface to support Luke QA?

    Any plans for huggingface to support Luke QA?

    Would be interested in using Luke for QA with huggingface installing with poetry has been rocky for me (issues with what version of huggingface to use probably not pinned) so complete QA support with Huggingface would be great.

    opened by swartchris8 0
Owner
Studio Ousia
Studio Ousia
Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

VidLanKD Implementation of VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer by Zineng Tang, Jaemin Cho, Hao Tan, Mohi

Zineng Tang 54 Dec 20, 2022
Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

PairRE Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors. This implementation of PairRE for Open Graph Benchmak datasets (

Alipay 65 Dec 19, 2022
🤖 A Python library for learning and evaluating knowledge graph embeddings

PyKEEN PyKEEN (Python KnowlEdge EmbeddiNgs) is a Python package designed to train and evaluate knowledge graph embedding models (incorporating multi-m

PyKEEN 1.1k Jan 9, 2023
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs This is the code of paper ConE: Cone Embeddings for Multi-Hop Reasoning over Knowl

MIRA Lab 33 Dec 7, 2022
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

Institute of Computational Perception 45 Dec 29, 2022
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
Build a medical knowledge graph based on Unified Language Medical System (UMLS)

UMLS-Graph Build a medical knowledge graph based on Unified Language Medical System (UMLS) Requisite Install MySQL Server 5.6 and import UMLS data int

Donghua Chen 6 Dec 25, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models This repo contains code and data for evaluating and mitigating bias from generat

Paul Liang 42 Jan 3, 2023
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 341 Dec 29, 2022
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English ⚖️ ?? ??‍?? ??‍⚖️ Dataset Summary Inspired by the recent widespread use of th

null 95 Dec 8, 2022
YouRefIt: Embodied Reference Understanding with Language and Gesture

YouRefIt: Embodied Reference Understanding with Language and Gesture YouRefIt: Embodied Reference Understanding with Language and Gesture by Yixin Che

null 16 Jul 11, 2022
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 3, 2022