Longformer: The Long-Document Transformer

AI2

Last update: Dec 29, 2022

Related tags

Text Data & NLP longformer

Overview

`Longformer`

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

Pretrained models: 1) led-base-16384, 2) led-large-16384
Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git
Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do

from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

We added a notebook to show how to convert an existing pretrained model into its "long" version.
Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

Download pretrained model

Install environment and code

conda create --name longformer python=3.7
conda activate longformer
conda install cudatoolkit=10.0
pip install git+https://github.com/allenai/longformer.git

Run the model

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer

config = LongformerConfig.from_pretrained('longformer-base-4096/') 
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1

# TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
# model = model.cuda(); input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                     # classification: the <s> token
                                     # QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

Training scripts: scripts/triviaqa.py
Pretrained large model: here (replicates leaderboard results)
Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments

ImportError: cannot import name 'nvcc'

from tvm.contrib import nvcc ImportError: cannot import name 'nvcc'

I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

opened by safooray 33
Text Classifier using longformer

Can we request to add a short example of longformer for long text/review classification? Current triviaQA is good but more examples will encourage further use of longformer.

Thanks. Patrick

opened by pchankh 14

RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

I'm trying to train a new model from scratch where it's length is 1024 (using huggingface implementation of longformer), but I get the following exception at a line that is recently added:

--> 150         is_global_attn = is_index_global_attn.flatten().any().item()
    151 
    152         hidden_states = hidden_states.transpose(0, 1)

RuntimeError: CUDA error: device-side assert triggered

I tried Reformer and it worked as expected. The Longfomer config is as follows?

LongformerConfig {
  "attention_probs_dropout_prob": 0.1,
  "attention_window": 64,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 1026,
  "model_type": "longformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 257,
  "sep_token_id": 258,
  "type_vocab_size": 2,
  "vocab_size": 261
}

Any idea what the issue is?

opened by zarandioon 13

segmentation fault illegal instruction

setup

ubuntu 16.04 tvm 0.7 dev1 pytorch 1.4.0 transformer 2.11.0 other same as requirements.txt

issue

I uncomment the line in diagonaled_mm_tvm.py DiagonaledMM._get_function('float32', 'cuda')

After that, When I run the code , it show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... segmentation fault (core dump) or show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... illegal instruction (core dump)

other

I test the tvm, tensorflow and pytorch, there are fine. And I follow the scripts/cheatsheet.txt to regenerate the lib_diagonaled_mm_float32_cuda.so, it can generate succeed.

Any idea or suggestion?

the code is below

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer

config = LongformerConfig.from_pretrained('longformer-base-4096/') 
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'tvm'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1

# TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
model = model.cuda(); input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                     # classification: the <s> token
                                     # QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(
        input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)

output = model(input_ids, attention_mask=attention_mask)[0]

opened by ProfXGiter 13

Using RoBERTa or LongFormer for texts with 16K tokens

LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096).

This is not entirely true. There's no "pooling" of the 4096 tokens into 512. We keep all 4096 tokens. The only change is how attention is computed; instead of every token attending to every other token, we change it such that every token attends to a smaller number of surrounding tokens. This speeds up selfattention computation (which is the bottleneck) by assuming that the attention score between certain pairs of words is zero. This doesn't change the architecture or introduce any pooling.

We are working on some code that will make it easy to train your own long model, so you can try longer sequences. We know it is easy to get to 16K or even 32k with RoBERTa-base architecture (need base model, fp16, gradient checkpointing). For sequences longer than that, you will need to find ways to save memory depending on your application. For example, reducing window size, reducing size of the feed forward layers, implementing reversible transformers, use sinusoidal position embedding instead of learned position embedding.

Originally posted by @ibeltagy in https://github.com/allenai/longformer/issues/48#issuecomment-634270401

opened by vr25 10
Not able to use the embedding for calculating similarity.

First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :) Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}' ................................... ...................... output = model(input_ids, attention_mask=attention_mask)[0]

I get a embedding of dimension: torch.Size([1, 512, 768]) and when I try to calculate the cosine similarity on these embeddings I get error saying : ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

However, I am unsure where should I append this line of code. I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

Thanks for help :)

opened by titu1992 10
help in understanding task global attention

Hi,

Need help in understanding the concept below?

So does this mean that the complexity is quadratic (if all tokens attend to all other tokens) for task tuning but linear otherwise?

Thanks!

opened by vr25 9
Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet. I user following instructions to reproduce it from cheatsheet.txt

// To run our pretrained TriviaQA large model (replicates the leaderboard results), // first download the pytorch-lightning checkpoint: // https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz // then run: python -m scripts.triviaqa
--train_dataset squad-wikipedia-train-4096.json \ # loaded but not used --dev_dataset squad-wikipedia-dev-4096.json
--gpus 0 --num_workers 4
--max_seq_len 4096 --doc_stride -1
--save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint --model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used --test # predictions will be saved into predictions.json

//then run the official evaluation scripts python -m scripts.triviaqa_utils.evaluation_utils
--dataset_file path/to/qa/wikipedia-dev.json
--prediction_file predictions.json //Output should be: {'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

opened by YJYJLee 9
How can I train the pre-train model on chinese corpus?

Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

opened by liangxg787 9
Fine-tuning Longformer for squad (out of memory)
I have pretrained an MLM Longformer using roberta-base based on this recipe.

Then I tried to fine-tune it for squad quetion-answering. Here is the trainer and following is the run-time setting (based on here):

python run_squad.py
--model_type roberta
--model_name_or_path pathe_to_roberta_base_mlm_trained_4096
--do_train
--do_eval
--do_lower_case
--train_file $SQUAD_DIR/train-v1.1.json
--predict_file $SQUAD_DIR/dev-v1.1.json
--per_gpu_train_batch_size 1
--learning_rate 3e-5
--num_train_epochs 2.0
--max_seq_length 4096
--doc_stride 128
--output_dir /tmp/debug_squad/

While I am using a V100 node (16-GPUs, 32 GB), it always faces memory limit of gpu as follow:

File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs)

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 642, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward hidden_states, attention_mask, head_mask, output_attentions=output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 31.72 GiB total capacity; 30.25 GiB already allocated; 300.38 MiB free; 30.29 GiB reserved in total by PyTorch)

However, using allenai/longformer-base-4096, it works. Could you please comment on what I may be missing in the above steps.
opened by arashashari 8

CUDA error: device-side assert triggered, while converting BERT to Long

Hi!

I got an apparently working code for converting a BERT model into a longformer, but now I am trying to convert BERTeus to Longoformer, which I expected to work in the same way (just changing the dataset + model name/path).

with a small(with big same issue) training corpus (50K lines), the training starts well, but it breaks around step 20, after 3-4 epochs.


2020-09-22 15:01:55.336576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-09-22 15:01:55.338202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
INFO:__main__:Loading the model from tmp/bert-base-4096
INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "gradient_checkpointing": true,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 4096,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 3,
  "type_vocab_size": 2,
  "vocab_size": 50099
}

INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
INFO:transformers.tokenization_utils_base:loading file None
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
INFO:transformers.tokenization_utils_base:loading file None
/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  FutureWarning,
INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "gradient_checkpointing": true,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 4096,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 3,
  "type_vocab_size": 2,
  "vocab_size": 50099
}

INFO:transformers.modeling_utils:loading weights file tmp/bert-base-4096/pytorch_model.bin
WARNING:transformers.modeling_utils:Some weights of the model checkpoint at tmp/bert-base-4096 were not used when initializing BertForMaskedLM: ['bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.key_global.bias', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.1.attention.self.value_global.bias', 'bert.encoder.layer.2.attention.self.query_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.2.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.key_global.bias', 'bert.encoder.layer.2.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.3.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.3.attention.self.value_global.weight', 'bert.encoder.layer.3.attention.self.value_global.bias', 'bert.encoder.layer.4.attention.self.query_global.weight', 'bert.encoder.layer.4.attention.self.query_global.bias', 'bert.encoder.layer.4.attention.self.key_global.weight', 'bert.encoder.layer.4.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.5.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.query_global.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.5.attention.self.key_global.bias', 'bert.encoder.layer.5.attention.self.value_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'bert.encoder.layer.6.attention.self.query_global.weight', 'bert.encoder.layer.6.attention.self.query_global.bias', 'bert.encoder.layer.6.attention.self.key_global.weight', 'bert.encoder.layer.6.attention.self.key_global.bias', 'bert.encoder.layer.6.attention.self.value_global.weight', 'bert.encoder.layer.6.attention.self.value_global.bias', 'bert.encoder.layer.7.attention.self.query_global.weight', 'bert.encoder.layer.7.attention.self.query_global.bias', 'bert.encoder.layer.7.attention.self.key_global.weight', 'bert.encoder.layer.7.attention.self.key_global.bias', 'bert.encoder.layer.7.attention.self.value_global.weight', 'bert.encoder.layer.7.attention.self.value_global.bias', 'bert.encoder.layer.8.attention.self.query_global.weight', 'bert.encoder.layer.8.attention.self.query_global.bias', 'bert.encoder.layer.8.attention.self.key_global.weight', 'bert.encoder.layer.8.attention.self.key_global.bias', 'bert.encoder.layer.8.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.bias', 'bert.encoder.layer.9.attention.self.query_global.weight', 'bert.encoder.layer.9.attention.self.query_global.bias', 'bert.encoder.layer.9.attention.self.key_global.weight', 'bert.encoder.layer.9.attention.self.key_global.bias', 'bert.encoder.layer.9.attention.self.value_global.weight', 'bert.encoder.layer.9.attention.self.value_global.bias', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.10.attention.self.query_global.bias', 'bert.encoder.layer.10.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.key_global.bias', 'bert.encoder.layer.10.attention.self.value_global.weight', 'bert.encoder.layer.10.attention.self.value_global.bias', 'bert.encoder.layer.11.attention.self.query_global.weight', 'bert.encoder.layer.11.attention.self.query_global.bias', 'bert.encoder.layer.11.attention.self.key_global.weight', 'bert.encoder.layer.11.attention.self.key_global.bias', 'bert.encoder.layer.11.attention.self.value_global.weight', 'bert.encoder.layer.11.attention.self.value_global.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:transformers.modeling_utils:All the weights of BertForMaskedLM were initialized from the model checkpoint at tmp/bert-base-4096.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
INFO:__main__:Pretraining bert-base-4096 ... 
INFO:filelock:Lock 140392820589624 acquired on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_valEusLong.txt [took 0.008 s]
INFO:filelock:Lock 140392820589624 released on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
INFO:__main__:Loading and tokenizing training data is usually slow: trainEusLong1.txt
INFO:filelock:Lock 140392820589456 acquired on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_trainEusLong1.txt [took 0.053 s]
INFO:filelock:Lock 140392820589456 released on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
INFO:transformers.training_args:PyTorch: setting up devices
INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 70
INFO:transformers.trainer:  Batch size = 1
Evaluation:   0%|                                                                                                                                                 | 0/70 [00:00<?, ?it/s]/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Evaluation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:21<00:00,  3.22it/s]
INFO:transformers.trainer:{'eval_loss': 12.326190962110246, 'step': 0}
INFO:__main__:Initial eval bpc: 17.782934574086813
INFO:transformers.trainer:***** Running training *****
INFO:transformers.trainer:  Num examples = 388
INFO:transformers.trainer:  Num Epochs = 501
INFO:transformers.trainer:  Instantaneous batch size per device = 1
INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
INFO:transformers.trainer:  Gradient Accumulation steps = 64
INFO:transformers.trainer:  Total optimization steps = 3000
INFO:transformers.trainer:  Starting fine-tuning.
Epoch:   0%|                                                                                                                                                     | 0/501 [00:00<?, ?it/sINFO:transformers.trainer:{'loss': 12.102866038680077, 'learning_rate': 6.000000000000001e-08, 'epoch': 0.16494845360824742, 'step': 1}                  | 63/388 [01:18<06:51,  1.27s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-1
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-1/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-1/pytorch_model.bin
INFO:transformers.trainer:{'loss': 12.099215269088745, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.32989690721649484, 'step': 2}                                 | 127/388 [02:50<05:35,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-2
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-2/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-2/pytorch_model.bin
INFO:transformers.trainer:{'loss': 12.078452616930008, 'learning_rate': 1.8e-07, 'epoch': 0.4948453608247423, 'step': 3}                                                 | 191/388 [04:24<04:14,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-3
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-3/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-3/pytorch_model.bin
INFO:transformers.trainer:{'loss': 12.023080185055733, 'learning_rate': 2.4000000000000003e-07, 'epoch': 0.6597938144329897, 'step': 4}                                  | 255/388 [05:56<02:50,  1.28s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-4
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-4/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-4/pytorch_model.bin
INFO:transformers.trainer:{'loss': 12.003526121377945, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.8247422680412371, 'step': 5}█████████▉                        | 319/388 [07:29<01:28,  1.29s/it]INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-5
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-5/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-5/pytorch_model.bin
INFO:transformers.trainer:{'loss': 11.993770495057106, 'learning_rate': 3.6e-07, 'epoch': 0.9896907216494846, 'step': 6}███████████████████████████████████████████████▎ | 383/388 [09:01<00:06,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-6
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-6/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-6/pytorch_model.bin
Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:18<00:00,  1.44s/it]
Epoch:   0%|▎                                                                                                                                        | 1/501 [09:18<77:36:08, 558.74s/it]                 INFO:transformers.trainer:{'loss': 12.672470852732658, 'learning_rate': 4.2e-07, 'epoch': 1.1649484536082475, 'step': 7}                                                   | 63/388 [01:20<06:58,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-7
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-7/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-7/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-8
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-8/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-8/pytorch_model.bin

Iteration:  36%|███████████████████████████████████████████████████████▏                                                                                                 | 140/388 [03:21<05:27,  1.32s/iItINFO:transformers.trainer:{'loss': 11.813278079032898, 'learning_rate': 5.4e-07, 'epoch': 1.4948453608247423, 'step': 9}                                                  | 191/388 [04:27<04:15,  1.30s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-9
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-9/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-9/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-10
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-10/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-10/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-11
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-11/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-11/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-12
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-12/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-12/pytorch_model.bin
Iteration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
Epoch:   0%|▌                                                                                                                                        | 2/501 [18:43<77:40:49, 560.42s/it]<00:00,  2.07s/it]INFO:transformers.trainer:{'loss': 12.117324143648148, 'learning_rate': 7.799999999999999e-07, 'epoch': 2.1649484536082473, 'step': 13}                                     | 63/388 [01:20<06:59,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-13
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-13/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-13/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-14
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-14/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-14/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-15
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-15/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-15/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-16
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-16/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-16/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-17
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-17/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-17/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-18
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-18/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-18/pytorch_model.bin
Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
Epoch:   1%|▊                                                                                                                                        | 3/501 [28:07<77:40:37, 561.52s/it]4<00:00,  2.07s/itINFO:transformers.trainer:{'loss': 11.206573352217674, 'learning_rate': 1.14e-06, 'epoch': 3.1649484536082473, 'step': 19}                                                  | 63/388 [01:20<06:58,  1.29s/it]
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-19
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-19/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-19/pytorch_model.bin
INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-20
INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-20/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-20/pytorch_model.bin

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Iteration:  39%|████████████████████████████████████████████████████████████▋                                                                                             | 153/388 [03:38<05:35,  1.43s/it]
Epoch:   1%|▊                                                                                                                                        | 3/501 [31:45<87:51:44, 635.15s/it]
Traceback (most recent call last):
  File "BERTeus2LongB.py", line 305, in <module>
    pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
  File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
    trainer.train(model_path=model_path)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
    output_hidden_states=output_hidden_states,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 182, in forward
    embeddings = inputs_embeds + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered

the same run with

###########################################

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

###########################################

...
Epoch:   1%|▉                                                                                                                                                          | 3/501 [30:52<85:25:53, 617.58s/it]
Traceback (most recent call last):
  File "BERTeus2LongB.py", line 305, in <module>
    pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
  File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
    trainer.train(model_path=model_path)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
    output_hidden_states=output_hidden_states,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 430, in forward
    encoder_attention_mask,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 155, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 74, in forward
    outputs = run_function(*args)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 420, in custom_forward
    return module(*inputs, output_attentions)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 243, in forward
    attention_scores = attention_scores + attention_mask
RuntimeError: CUDA error: device-side assert triggered
(transformers) gurbizu@azken:/mnt/datuak/gorka-tmp$ python BERTeus2LongB.py

Any hint what causes this error?

By the way, I also got sometimes this error, which I am not able to reproduce right now:

 File "BERTeus2LongB.py", line 305, in <module>
    pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
  ...
  File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Regards, Gorka

opened by GorkaUrbizu 7

Number of tokens per batch mismatch - longformer vs roberta

I see in your conversion notebook that you suggest that the number of tokens per batch should be the same as roberta: 2^18 = 260k

When I look at the roberta paper, it says it uses a sequence length of 512 and a batch size of 8k. This means that each batch has 512*8k = 4M tokens

Am I missing something?

opened by nbroad1881 1
Answering performance of Longformer-base on the HotpotQA dev set

Hi,

I only found Longformer-base's joint F1 on the HopotQA dev set from the paper, and I would like to know if my reproduction results (Ans EM = 61.38, Ans F1 = 75.18) are expected. Could you provide some more specific metrics?

Thank you!

opened by zycdev 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Updated BART to Longformer-encoder-decoder (LED) converter
Hi @ibeltagy et al., I'm pre-training BART to Portuguese and converting the pre-trained model to LED following the instructions you gave in the paper and the code at https://github.com/allenai/longformer/blob/caefee668e39cacdece7dd603a0bebf24df6d8ca/scripts/convert_bart_to_longformerencoderdecoder.py.

The huggingface library is evolving fast; unfortunately, the code you provided is outdated and I had to implement a new version based on yours.

I have 2 questions:

Could you tell me if everything is ok or if I missed something? https://gist.github.com/erichans/af745a381b28b1c019f96997ddac4cd7

Is the LEDForConditionalGeneration model uploaded to huggingface just a BART model converted to LED or is there something else?

Thanks in advance!
opened by erichans 0
Why the TVM impelmentation is memroy efficient

Thanks for your excellent work!

Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

Looking forward to your reply.

opened by jlidw 0
Pretraining longformer for NER on big pdf text

Hi, I'm trying to extract entities from documents containing 50-60 pages per document. can anybody suggest a better approach for it, please? I couldn't find any NER implementation of longformers.

opened by ajaysurya1221 0