LUKE -- Language Understanding with Knowledge-based Embeddings

Related tags

Text Data & NLP luke
Overview

LUKE

CircleCI


LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. It was proposed in our paper LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. It achieves state-of-the-art results on important NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).

This repository contains the source code to pre-train the model and fine-tune it to solve downstream tasks.

News

November 24, 2021: Entity disambiguation example is available

The example code of entity disambiguation based on LUKE has been added to this repository. This model was originally proposed in our paper, and achieved state-of-the-art results on five standard entity disambiguation datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI.

For further details, please refer to the example directory.

August 3, 2021: New example code based on Hugging Face Transformers and AllenNLP is available

New fine-tuning examples of three downstream tasks, i.e., NER, relation classification, and entity typing, have been added to LUKE. These examples are developed based on Hugging Face Transformers and AllenNLP. The fine-tuning models are defined using simple AllenNLP's Jsonnet config files!

The example code is available in the examples_allennlp directory.

May 5, 2021: LUKE is added to Hugging Face Transformers

LUKE has been added to the master branch of the Hugging Face Transformers library. You can now solve entity-related tasks (e.g., named entity recognition, relation classification, entity typing) easily using this library.

For example, the LUKE-large model fine-tuned on the TACRED dataset can be used as follows:

>>> from transformers import LukeTokenizer, LukeForEntityPairClassification
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> text = "Beyoncé lives in Los Angeles."
>>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> predicted_class_idx = int(logits[0].argmax())
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
Predicted class: per:cities_of_residence

We also provide the following three Colab notebooks that show how to reproduce our experimental results on CoNLL-2003, TACRED, and Open Entity datasets using the library:

Please refer to the official documentation for further details.

November 5, 2021: LUKE-500K (base) model

We released LUKE-500K (base), a new pretrained LUKE model which is smaller than existing LUKE-500K (large). The experimental results of the LUKE-500K (base) and LUKE-500K (large) on SQuAD v1 and CoNLL-2003 are shown as follows:

Task Dataset Metric LUKE-500K (base) LUKE-500K (large)
Extractive Question Answering SQuAD v1.1 EM/F1 86.1/92.3 90.2/95.4
Named Entity Recognition CoNLL-2003 F1 93.3 94.3

We tuned only the batch size and learning rate in the experiments based on LUKE-500K (base).

Comparison with State-of-the-Art

LUKE outperforms the previous state-of-the-art methods on five important NLP tasks:

Task Dataset Metric LUKE-500K (large) Previous SOTA
Extractive Question Answering SQuAD v1.1 EM/F1 90.2/95.4 89.9/95.1 (Yang et al., 2019)
Named Entity Recognition CoNLL-2003 F1 94.3 93.5 (Baevski et al., 2019)
Cloze-style Question Answering ReCoRD EM/F1 90.6/91.2 83.1/83.7 (Li et al., 2019)
Relation Classification TACRED F1 72.7 72.0 (Wang et al. , 2020)
Fine-grained Entity Typing Open Entity F1 78.2 77.6 (Wang et al. , 2020)

These numbers are reported in our EMNLP 2020 paper.

Installation

LUKE can be installed using Poetry:

$ poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Released Models

We initially release the pre-trained model with 500K entity vocabulary based on the roberta.large model.

Name Base Model Entity Vocab Size Params Download
LUKE-500K (base) roberta.base 500K 253 M Link
LUKE-500K (large) roberta.large 500K 483 M Link

Reproducing Experimental Results

The experiments were conducted using Python3.6 and PyTorch 1.2.0 installed on a server with a single or eight NVidia V100 GPUs. We used NVidia's PyTorch Docker container 19.02. For computational efficiency, we used mixed precision training based on APEX library which can be installed as follows:

$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

The APEX library is not needed if you do not use --fp16 option or reproduce the results based on the trained checkpoint files.

The commands that reproduce the experimental results are provided as follows:

Entity Typing on Open Entity Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=2 \
    --learning-rate=1e-5 \
    --num-train-epochs=3 \
    --fp16

Relation Classification on TACRED Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=4 \
    --gradient-accumulation-steps=8 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Named Entity Recognition on CoNLL-2003 Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli\
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Cloze-style Question Answering on ReCoRD Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=1 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=2 \
    --fp16

Extractive Question Answering on SQuAD 1.1 Dataset

Dataset: Link
Checkpoint file (compressed): Link
Wikipedia data files (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --train-batch-size=2 \
    --gradient-accumulation-steps=3 \
    --learning-rate=15e-6 \
    --num-train-epochs=2 \
    --fp16

Citation

If you use LUKE in your work, please cite the original paper:

@inproceedings{yamada2020luke,
  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},
  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},
  booktitle={EMNLP},
  year={2020}
}

Contact Info

Please submit a GitHub issue or send an e-mail to Ikuya Yamada ([email protected]) for help or issues using LUKE.

Comments
  • Adding LUKE to HuggingFace Transformers

    Adding LUKE to HuggingFace Transformers

    Hi, Is there a possibility to reproduce results for NER on CPU instead of the default GPU configuration? I am unable to find any resource for this on the repo.

    I am using the following command, but there seems to be no flag/argument available to switch between CPU and GPU?

    python -m examples.cli --model-file=luke_large_500k.tar.gz --output-dir=<OUTPUT_DIR> ner run --data-dir=<DATA_DIR> --fp16 --train-batch-size=2 --gradient-accumulation-steps=2 --learning-rate=1e-5 --num-train-epochs=5
    

    Thanks in advance!

    opened by uahmad235 46
  • Pretraining instruction

    Pretraining instruction

    Hi authors,

    Awesome work! Thanks for your codes and instructions. Recently, I want to pretrain a new Luke model on my own dataset. Could you write a pretraining instruction so I can learn? Thank you!

    opened by JiachengLi1995 19
  • the result of squad

    the result of squad

    I use poetry to build experiment environment, and I try to reproduce your paper performance follow your advise, bug failed. EM F1 paper 90.2 95.4 your checkpoint 89.76 94.97 your finetune 89.04 94.69

    do you know the reason?

    opened by TingFree 16
  • Two questions: 1.Release entity vocab's wikipedia pageid? 2. Does [mask] occupy bert's 512 input?

    Two questions: 1.Release entity vocab's wikipedia pageid? 2. Does [mask] occupy bert's 512 input?

    1. Right now, some titles in entity vocab can not align to a unique wikipedia paged or wikidata entity id: some are missing, and some titles refer to a same pageid. Can you release the mapping between entity vocab's title and wikipedia pageid / wikidata entity id?
    2. It seems that [mask]s for span representation don't occupy bert's 512 input? For example, I have a sequence with 512 tokens, and I want to use LUKE to extract 10 spans, then I can input 512 tokens+10 [mask], rather than 502 tokens + 10 [mask]? (as long as [mask] position embedding is correctly aligned to the 10 spans?)
    opened by dalek-who 13
  • Getting RuntimeError for LukeRelationClassification

    Getting RuntimeError for LukeRelationClassification

    While trying to replicate results using pre-trained model for Relation Classification, I am getting the following error. I looked at the function load_state_dict(), strict argument is set to False.

    Traceback (most recent call last):
    
    
      File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
    
      File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
    
      File "/home/akshay/re_rc/luke/examples/cli.py", line 132, in <module>
        cli()
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
    
      File "/home/akshay/re_rc/luke/examples/utils/trainer.py", line 32, in wrapper
        return func(*args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/click/decorators.py", line 33, in new_func
        return f(get_current_context().obj, *args, **kwargs)
    
      File "/home/akshay/re_rc/luke/examples/relation_classification/main.py", line 110, in run
        model.load_state_dict(torch.load(args.checkpoint_file, map_location="cpu"))
    
      File "/home/akshay/re_rc/luke/luke/model.py", line 236, in load_state_dict
        super(LukeEntityAwareAttentionModel, self).load_state_dict(new_state_dict, *args, **kwargs)
    
      File "/home/akshay/pyTorch-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
        self.__class__.__name__, "\n\t".join(error_msgs)))
    
    RuntimeError: Error(s) in loading state_dict for LukeForRelationClassification:
    	size mismatch for embeddings.word_embeddings.weight: copying a param with shape torch.Size([50266, 1024]) from checkpoint, the shape in current model is torch.Size([50267, 1024]).
    
    	size mismatch for entity_embeddings.entity_embeddings.weight: copying a param with shape torch.Size([2, 256]) from checkpoint, the shape in current model is torch.Size([3, 256]).
    

    I cannot understand the reason behind this. Can somebody please explain!

    opened by akshayparakh25 13
  • LUKE Large Finetuning Duration for NER

    LUKE Large Finetuning Duration for NER

    Hello there,

    I'm trying to finetune LUKE large for NER task using data with conll format from multiple sources including conll2003 itself, I've combined them together. The training is being done on Google Colab GPU, but it takes so long to finish a single batch, it took about 2 hours to train on just 2 batches with a batch size of 2, is this expected? And if not, why does this happen?

    Thanks in advance

    opened by taghreed34 12
  • Pretraining for a Different Language and Testing Pretrained Model for Entity Disambiguation

    Pretraining for a Different Language and Testing Pretrained Model for Entity Disambiguation

    Hi,

    I have pretrained an Entity Disambiguation model with the recent instructions of pretraining Luke. From the instructions shared here and in another issue #115 , I was able to perform two-step pretraining for Turkish.

    I am sharing the commands and config files I have used in order to make sure that nothing was non-logical. I have run the following code for the first stage with the configuration setup below; Code:

    deepspeed \
    --num_gpus=6 luke/pretraining/train.py \
    --output-dir=training_on_turkish/luke-bert-base-turkish-first-stage \
    --deepspeed-config-file=pretraining_config/luke_base_stage1.json \
    --dataset-dir=training_on_turkish/tr_pretraining_dataset \
    --bert-model-name=dbmdz/bert-base-turkish-uncased  \
    --num-epochs=1 \
    --fix-bert-weights \
    --masked-entity-prob=0.30 \
    --masked-lm-prob=0
    

    Config:

    {
      "train_batch_size": 24,
      "train_micro_batch_size_per_gpu": 4,
      "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 5e-4,
          "betas": [0.9, 0.999],
          "eps": 1e-6,
          "weight_decay": 0.01,
          "bias_correction": false
        }
      },
      "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 5e-4,
          "warmup_num_steps": 1000,
          "total_num_steps": 192796,
          "warmup_type": "linear"
        }
      },
      "gradient_clipping": 10000.0
    }
    

    And the second line of code with the configuration setup is below; Command:

    deepspeed
    --num_gpus=6 luke/pretraining/train.py \ 
    --output-dir=training_on_turkish/luke-bert-base-turkish-second-stage \
    --deepspeed-config-file=pretraining_config/luke_base_stage2.json \
    --dataset-dir=training_on_turkish/tr_pretraining_dataset/ \
    --bert-model-name=dbmdz/bert-base-turkish-uncased \
    --num-epochs=5 \
    --reset-optimization-states \
    --resume-checkpoint-id=training_on_turkish/luke-bert-base-turkish-first-stage/checkpoints/epoch1/ \
    --masked-entity-prob=0.30 \
    --masked-lm-prob=0
    

    Config:

    {
      "train_batch_size": 24,
      "train_micro_batch_size_per_gpu": 4,
      "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 1e-5,
          "betas": [0.9, 0.999],
          "eps": 1e-6,
          "weight_decay": 0.01,
          "bias_correction": false
        }
      },
      "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 1e-5,
          "warmup_num_steps": 2500,
          "total_num_steps": 98168,
          "warmup_type": "linear"
        }
      },
      "gradient_clipping": 10000.0
    }
    

    I guess now I have a model that can perform Entity Disambiguation? The question is that I cannot see any clear example for performing or evaluating a pretrained model on Entity Disambiguation. How the data should be formatted? How should I call the model in order to make predictions?

    Thank you for your time.

    opened by fatihbeyhan 12
  • Assertion error with CONLL03

    Assertion error with CONLL03

    Hi, here I met another problem when using luke on NER dataset CONLL03... When creating features from examples, the variable entity_labels is empty at some examples, like train-945:

    guid=train-945
    words=['SOCCER', '-', 'ENGLISH', 'SOCCER', 'RESULTS', '.', 'LONDON', '1996-08-30', 'Results', 'of', 'English', 'league', 'matches', 'on', 'Friday', ':', 'Division', 'two', 'Plymouth', '2', 'Preston', '1', 'Division', 'three', 'Swansea', '1', 'Lincoln', '2']
    labels=['O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O']
    

    and the code here throws an AssertionError: https://github.com/studio-ousia/luke/blob/9323b216dd5f72b61545bc4133f7709fd19bfa95/examples/ner/utils.py#L239 Do you have any idea what's wrong with these examples?

    opened by Riroaki 12
  • The meaning of entity_ids=[1, 0] when finetune OpenEntity dataset

    The meaning of entity_ids=[1, 0] when finetune OpenEntity dataset

    I am runing the fine-tuning process on OpenEntity, but I can not understand well the code below, why entity_ids=[1, 0] entity_ids = [1, 0] entity_attention_mask = [1, 0] entity_segment_ids = [0, 0] entity_position_ids = list(range(mention_start, mention_end))[:max_mention_length] entity_position_ids += [-1] * (max_mention_length - mention_end + mention_start) entity_position_ids = [entity_position_ids, [-1] * max_mention_length]

    opened by lshowway 10
  • Preparing Environment For Allen-nlp.

    Preparing Environment For Allen-nlp.

    Hi @ikuyamada , I am facing issues while creating environment setup for allennlp based ner and re solution. I come to know that it works only on python3.7 so I dockerized it with the requirements.txt file and then run the container. This time the package poetry was not found. I added it as an extra requirement and it starts running. For both allennlp based solutions, it errored out below: Screenshot from 2022-03-19 15-45-30 How to prepare environment for this solution?? Can I use higher versions of python for that??

    opened by elonmusk-01 9
  • Pretraining Problem

    Pretraining Problem

    Hi @ikuyamada,

    Thanks for your amazing work about this entity-aware language model. I am interested to build a LUKE model for the Indonesian language. Since I couldn't find any documentation about how to train the model, I did the following steps:

    1. Build the dump DB (build-dump-db)
    2. Build the entity vocab (build-entity-vocab)
    3. Build Wiki pretraining dataset (build-wikipedia-pretraining-dataset)
    4. Do the pretraining

    However, when starting to do the pretraining, I got some errors:

    Traceback (most recent call last):
      File "/usr/playground/luke/luke/pretraining/train.py", line 353, in run_pretraining
        result = model(**batch)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/playground/luke/luke/pretraining/model.py", line 81, in forward
        entity_attention_mask,
      File "/usr/playground/luke/luke/model.py", line 109, in forward
        entity_embedding_output = self.entity_embeddings(entity_ids, entity_position_ids, entity_segment_ids)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/playground/luke/luke/model.py", line 60, in forward
        entity_embeddings = self.entity_embedding_dense(entity_embeddings)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
        return F.linear(input, self.weight, self.bias)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
    [2020-12-09 00:43:15,490] [ERROR] Consecutive errors have been observed. Exiting... ([email protected]:379)
    Traceback (most recent call last):
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in run_pretraining
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in <dictcomp>
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
    RuntimeError: CUDA error: an illegal memory access was encountered
    Traceback (most recent call last):
      File "./luke/cli.py", line 67, in <module>
        cli()
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/.pyenv/versions/luke/lib/python3.7/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/usr/playground/luke/luke/pretraining/train.py", line 82, in pretrain
        run_pretraining(Namespace(**kwargs))
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in run_pretraining
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
      File "/usr/playground/luke/luke/pretraining/train.py", line 352, in <dictcomp>
        batch = {k: torch.from_numpy(v).to(device) for k, v in batch.items()}
    RuntimeError: CUDA error: an illegal memory access was encountered
    

    Meanwhile, when trying to run the code in cpu, I got this error: IndexError: index out of range in self

    Is it a cuda error or maybe because of the tensors?

    Thank you in advance for your help!

    Best, Oryza

    opened by khairunnisaor 9
  • Will luke support fast tokenizer

    Will luke support fast tokenizer

    Hello everyone, I am tring to use luke-large for question answering. I met serveral issues when finetune the model by SQAUD-like data, most of the issues comes by not supporting fast tokenizer. So I am wondering if luke will support fast tokenizer in the future, or is any ways to solve the issues. Thank you so much!

    opened by TrickyyH 1
  • Entity Mapping Preprocessing

    Entity Mapping Preprocessing

    Hi, first of all, thank you for the nice work.

    Let's take the below input example.

    "Everaldo has played for Guarani and Santa Cruz in the Campeonato Brasileiro, before moving to Mexico where he played for Chiapas and Necaxa." , entity: Guarani .

    When training the model through the input, [MASK] token is added for masking Guarani entity. Then, the model is trained by predicting [MASK] as Guarani through Cross Entropy Loss.

    However, when we analyze entity_vocab.json, there isn't "Guarani". The entity_vocab.json only have "Guarani language", "Guarani FC", "Tupi\u2013Guarani languages", "Guarani mythology". In that example, I believe that Guarani means Guarani FC.

    Therefore, is the model trained to predict [MASK] as Guarani FC? If yes, we need to let the model know Guarani means Guarani FC. And, I guess that we need to match Guarani with Guarani FC.

    The preprocessing in https://github.com/studio-ousia/luke/blob/master/pretraining.md, deals with such issues ?

    Thank you.

    opened by kimwongyuda 1
  • Replace Luke with MLuke in Notebook/ConLL-2003

    Replace Luke with MLuke in Notebook/ConLL-2003

    Hi!

    I'm trying to run MLuke on https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb by replacing studio-ousia/luke-large-finetuned-conll-2003 with studio-ousia/mluke-large-lite-finetuned-conll-2003 and changing LukeTokenizer to MLukeTokenizer. Every thing looks find until the block:

    batch_size = 2 all_logits = []

    for batch_start_idx in trange(0, len(test_examples), batch_size): batch_examples = test_examples[batch_start_idx:batch_start_idx + batch_size] texts = [example["text"] for example in batch_examples] entity_spans = [example["entity_spans"] for example in batch_examples]

    inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
    inputs = inputs.to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)
    all_logits.extend(outputs.logits.tolist())
    

    The error is

    AttributeError Traceback (most recent call last) Cell In [8], line 12 10 inputs = inputs.to("cuda") 11 with torch.no_grad(): ---> 12 outputs = model(**inputs) 13 all_logits.extend(outputs.logits.tolist())

    File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs) 1098 # If we don't have any hooks, we want to skip the rest of the logic in 1099 # this function, and just call forward. 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(*input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

    File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/transformers/models/luke/modeling_luke.py:1588, in LukeForEntitySpanClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, entity_start_positions, entity_end_positions, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1571 outputs = self.luke( 1572 input_ids=input_ids, 1573 attention_mask=attention_mask, (...) 1584 return_dict=True, 1585 ) 1586 hidden_size = outputs.last_hidden_state.size(-1) -> 1588 entity_start_positions = entity_start_positions.unsqueeze(-1).expand(-1, -1, hidden_size) 1589 start_states = torch.gather(outputs.last_hidden_state, -2, entity_start_positions) 1590 entity_end_positions = entity_end_positions.unsqueeze(-1).expand(-1, -1, hidden_size)

    AttributeError: 'NoneType' object has no attribute 'unsqueeze'

    Thank you.

    opened by mrpeerat 4
  • Training using HF Transformers

    Training using HF Transformers

    Hi authors, Thank you sharing this interesting piece of work. I was trying this model on custom NER dataset and compare it with other BERT variants. To that end, I was wondering if you could provide instructions on how to finetune this model on custom NER dataset and what should be the dataset format.

    Also, instructions on pre-training the base model (without any head) using unlabled corpus would be really really useful. I saw some instructions around pre-training using allennlp library but there is some friction there. Since HF now is fairly stable library and widely popular, would appreciate if you could provide instructions on using LUKE using HF.

    opened by NiteshMethani 0
  • Luke NER Fine Tuning on Custom Entities.

    Luke NER Fine Tuning on Custom Entities.

    Hi, I am trying to fine tune luke base model and prepared conll like dataset with two columns (one for tokens and other for labels). The training runs smoothly but asserts no entity label when trying to make predictions. Can I really know what changes in the code are supposed to be making luke ner solution useful for custom ner with different number of classes.

    opened by ahmadaii 12
  • Any plans for huggingface to support Luke QA?

    Any plans for huggingface to support Luke QA?

    Would be interested in using Luke for QA with huggingface installing with poetry has been rocky for me (issues with what version of huggingface to use probably not pinned) so complete QA support with Huggingface would be great.

    opened by swartchris8 0
Owner
Studio Ousia
Studio Ousia
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 1, 2021
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 3, 2023
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 8, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

null 74 Dec 13, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 6, 2023
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022