Longformer: The Long-Document Transformer

Overview

Longformer

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

  • Pretrained models: 1) led-base-16384, 2) led-large-16384

  • Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git

  • Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

  1. Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

  1. We added a notebook to show how to convert an existing pretrained model into its "long" version.

  2. Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

  1. Download pretrained model
  1. Install environment and code

    conda create --name longformer python=3.7
    conda activate longformer
    conda install cudatoolkit=10.0
    pip install git+https://github.com/allenai/longformer.git
  2. Run the model

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'sliding_chunks'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

  • Training scripts: scripts/triviaqa.py
  • Pretrained large model: here (replicates leaderboard results)
  • Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • ImportError: cannot import name 'nvcc'

    ImportError: cannot import name 'nvcc'

    from tvm.contrib import nvcc ImportError: cannot import name 'nvcc'

    I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

    My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

    opened by safooray 33
  • Text Classifier using longformer

    Text Classifier using longformer

    Can we request to add a short example of longformer for long text/review classification? Current triviaQA is good but more examples will encourage further use of longformer.

    Thanks. Patrick

    opened by pchankh 14
  • RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    I'm trying to train a new model from scratch where it's length is 1024 (using huggingface implementation of longformer), but I get the following exception at a line that is recently added:

    --> 150         is_global_attn = is_index_global_attn.flatten().any().item()
        151 
        152         hidden_states = hidden_states.transpose(0, 1)
    
    RuntimeError: CUDA error: device-side assert triggered
    

    I tried Reformer and it worked as expected. The Longfomer config is as follows?

    LongformerConfig {
      "attention_probs_dropout_prob": 0.1,
      "attention_window": 64,
      "bos_token_id": 0,
      "eos_token_id": 2,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 1026,
      "model_type": "longformer",
      "num_attention_heads": 12,
      "num_hidden_layers": 6,
      "pad_token_id": 257,
      "sep_token_id": 258,
      "type_vocab_size": 2,
      "vocab_size": 261
    }
    

    Any idea what the issue is?

    opened by zarandioon 13
  • segmentation fault illegal instruction

    segmentation fault illegal instruction

    setup

    ubuntu 16.04 tvm 0.7 dev1 pytorch 1.4.0 transformer 2.11.0 other same as requirements.txt

    issue

    I uncomment the line in diagonaled_mm_tvm.py DiagonaledMM._get_function('float32', 'cuda')

    After that, When I run the code , it show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... segmentation fault (core dump) or show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... illegal instruction (core dump)

    other

    I test the tvm, tensorflow and pytorch, there are fine. And I follow the scripts/cheatsheet.txt to regenerate the lib_diagonaled_mm_float32_cuda.so, it can generate succeed.

    Any idea or suggestion?

    the code is below

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'tvm'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]
    
    opened by ProfXGiter 13
  • Using RoBERTa or LongFormer for texts with 16K tokens

    Using RoBERTa or LongFormer for texts with 16K tokens

    LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096).

    This is not entirely true. There's no "pooling" of the 4096 tokens into 512. We keep all 4096 tokens. The only change is how attention is computed; instead of every token attending to every other token, we change it such that every token attends to a smaller number of surrounding tokens. This speeds up selfattention computation (which is the bottleneck) by assuming that the attention score between certain pairs of words is zero. This doesn't change the architecture or introduce any pooling.

    We are working on some code that will make it easy to train your own long model, so you can try longer sequences. We know it is easy to get to 16K or even 32k with RoBERTa-base architecture (need base model, fp16, gradient checkpointing). For sequences longer than that, you will need to find ways to save memory depending on your application. For example, reducing window size, reducing size of the feed forward layers, implementing reversible transformers, use sinusoidal position embedding instead of learned position embedding.

    Originally posted by @ibeltagy in https://github.com/allenai/longformer/issues/48#issuecomment-634270401

    opened by vr25 10
  • Not able to use the embedding for calculating similarity.

    Not able to use the embedding for calculating similarity.

    First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :) Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

    However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}' ................................... ...................... output = model(input_ids, attention_mask=attention_mask)[0]

    I get a embedding of dimension: torch.Size([1, 512, 768]) and when I try to calculate the cosine similarity on these embeddings I get error saying : ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

    I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

    However, I am unsure where should I append this line of code. I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

    Thanks for help :)

    opened by titu1992 10
  • help in understanding task global attention

    help in understanding task global attention

    Hi,

    Need help in understanding the concept below?

    image

    So does this mean that the complexity is quadratic (if all tokens attend to all other tokens) for task tuning but linear otherwise?

    Thanks!

    opened by vr25 9
  • Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet. I user following instructions to reproduce it from cheatsheet.txt

    // To run our pretrained TriviaQA large model (replicates the leaderboard results), // first download the pytorch-lightning checkpoint: // https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz // then run: python -m scripts.triviaqa
    --train_dataset squad-wikipedia-train-4096.json \ # loaded but not used --dev_dataset squad-wikipedia-dev-4096.json
    --gpus 0 --num_workers 4
    --max_seq_len 4096 --doc_stride -1
    --save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint --model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used --test # predictions will be saved into predictions.json

    //then run the official evaluation scripts python -m scripts.triviaqa_utils.evaluation_utils
    --dataset_file path/to/qa/wikipedia-dev.json
    --prediction_file predictions.json
    //Output should be: {'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

    But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

    I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

    Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

    opened by YJYJLee 9
  • How can I train the pre-train model on chinese corpus?

    How can I train the pre-train model on chinese corpus?

    Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

    opened by liangxg787 9
  • Fine-tuning Longformer for squad (out of memory)

    Fine-tuning Longformer for squad (out of memory)

    I have pretrained an MLM Longformer using roberta-base based on this recipe.

    Then I tried to fine-tune it for squad quetion-answering. Here is the trainer and following is the run-time setting (based on here):

    python run_squad.py
    --model_type roberta
    --model_name_or_path pathe_to_roberta_base_mlm_trained_4096
    --do_train
    --do_eval
    --do_lower_case
    --train_file $SQUAD_DIR/train-v1.1.json
    --predict_file $SQUAD_DIR/dev-v1.1.json
    --per_gpu_train_batch_size 1
    --learning_rate 3e-5
    --num_train_epochs 2.0
    --max_seq_length 4096
    --doc_stride 128
    --output_dir /tmp/debug_squad/

    While I am using a V100 node (16-GPUs, 32 GB), it always faces memory limit of gpu as follow:

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
    

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 642, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward hidden_states, attention_mask, head_mask, output_attentions=output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 31.72 GiB total capacity; 30.25 GiB already allocated; 300.38 MiB free; 30.29 GiB reserved in total by PyTorch)

    However, using allenai/longformer-base-4096, it works. Could you please comment on what I may be missing in the above steps.

    opened by arashashari 8
  • CUDA error: device-side assert triggered, while converting BERT to Long

    CUDA error: device-side assert triggered, while converting BERT to Long

    Hi!

    I got an apparently working code for converting a BERT model into a longformer, but now I am trying to convert BERTeus to Longoformer, which I expected to work in the same way (just changing the dataset + model name/path).

    with a small(with big same issue) training corpus (50K lines), the training starts well, but it breaks around step 20, after 3-4 epochs.

    
    2020-09-22 15:01:55.336576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
    2020-09-22 15:01:55.338202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
    INFO:__main__:Loading the model from tmp/bert-base-4096
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
    INFO:transformers.tokenization_utils_base:loading file None
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
    INFO:transformers.tokenization_utils_base:loading file None
    /mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
      FutureWarning,
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.modeling_utils:loading weights file tmp/bert-base-4096/pytorch_model.bin
    WARNING:transformers.modeling_utils:Some weights of the model checkpoint at tmp/bert-base-4096 were not used when initializing BertForMaskedLM: ['bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.key_global.bias', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.1.attention.self.value_global.bias', 'bert.encoder.layer.2.attention.self.query_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.2.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.key_global.bias', 'bert.encoder.layer.2.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.3.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.3.attention.self.value_global.weight', 'bert.encoder.layer.3.attention.self.value_global.bias', 'bert.encoder.layer.4.attention.self.query_global.weight', 'bert.encoder.layer.4.attention.self.query_global.bias', 'bert.encoder.layer.4.attention.self.key_global.weight', 'bert.encoder.layer.4.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.5.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.query_global.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.5.attention.self.key_global.bias', 'bert.encoder.layer.5.attention.self.value_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'bert.encoder.layer.6.attention.self.query_global.weight', 'bert.encoder.layer.6.attention.self.query_global.bias', 'bert.encoder.layer.6.attention.self.key_global.weight', 'bert.encoder.layer.6.attention.self.key_global.bias', 'bert.encoder.layer.6.attention.self.value_global.weight', 'bert.encoder.layer.6.attention.self.value_global.bias', 'bert.encoder.layer.7.attention.self.query_global.weight', 'bert.encoder.layer.7.attention.self.query_global.bias', 'bert.encoder.layer.7.attention.self.key_global.weight', 'bert.encoder.layer.7.attention.self.key_global.bias', 'bert.encoder.layer.7.attention.self.value_global.weight', 'bert.encoder.layer.7.attention.self.value_global.bias', 'bert.encoder.layer.8.attention.self.query_global.weight', 'bert.encoder.layer.8.attention.self.query_global.bias', 'bert.encoder.layer.8.attention.self.key_global.weight', 'bert.encoder.layer.8.attention.self.key_global.bias', 'bert.encoder.layer.8.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.bias', 'bert.encoder.layer.9.attention.self.query_global.weight', 'bert.encoder.layer.9.attention.self.query_global.bias', 'bert.encoder.layer.9.attention.self.key_global.weight', 'bert.encoder.layer.9.attention.self.key_global.bias', 'bert.encoder.layer.9.attention.self.value_global.weight', 'bert.encoder.layer.9.attention.self.value_global.bias', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.10.attention.self.query_global.bias', 'bert.encoder.layer.10.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.key_global.bias', 'bert.encoder.layer.10.attention.self.value_global.weight', 'bert.encoder.layer.10.attention.self.value_global.bias', 'bert.encoder.layer.11.attention.self.query_global.weight', 'bert.encoder.layer.11.attention.self.query_global.bias', 'bert.encoder.layer.11.attention.self.key_global.weight', 'bert.encoder.layer.11.attention.self.key_global.bias', 'bert.encoder.layer.11.attention.self.value_global.weight', 'bert.encoder.layer.11.attention.self.value_global.bias']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    INFO:transformers.modeling_utils:All the weights of BertForMaskedLM were initialized from the model checkpoint at tmp/bert-base-4096.
    If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
    INFO:__main__:Pretraining bert-base-4096 ... 
    INFO:filelock:Lock 140392820589624 acquired on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_valEusLong.txt [took 0.008 s]
    INFO:filelock:Lock 140392820589624 released on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:__main__:Loading and tokenizing training data is usually slow: trainEusLong1.txt
    INFO:filelock:Lock 140392820589456 acquired on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_trainEusLong1.txt [took 0.053 s]
    INFO:filelock:Lock 140392820589456 released on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.training_args:PyTorch: setting up devices
    INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
    INFO:transformers.trainer:***** Running Evaluation *****
    INFO:transformers.trainer:  Num examples = 70
    INFO:transformers.trainer:  Batch size = 1
    Evaluation:   0%|                                                                                                                                                 | 0/70 [00:00<?, ?it/s]/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
      warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    Evaluation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:21<00:00,  3.22it/s]
    INFO:transformers.trainer:{'eval_loss': 12.326190962110246, 'step': 0}
    INFO:__main__:Initial eval bpc: 17.782934574086813
    INFO:transformers.trainer:***** Running training *****
    INFO:transformers.trainer:  Num examples = 388
    INFO:transformers.trainer:  Num Epochs = 501
    INFO:transformers.trainer:  Instantaneous batch size per device = 1
    INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
    INFO:transformers.trainer:  Gradient Accumulation steps = 64
    INFO:transformers.trainer:  Total optimization steps = 3000
    INFO:transformers.trainer:  Starting fine-tuning.
    Epoch:   0%|                                                                                                                                                     | 0/501 [00:00<?, ?it/sINFO:transformers.trainer:{'loss': 12.102866038680077, 'learning_rate': 6.000000000000001e-08, 'epoch': 0.16494845360824742, 'step': 1}                  | 63/388 [01:18<06:51,  1.27s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-1
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-1/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-1/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.099215269088745, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.32989690721649484, 'step': 2}                                 | 127/388 [02:50<05:35,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-2
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-2/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-2/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.078452616930008, 'learning_rate': 1.8e-07, 'epoch': 0.4948453608247423, 'step': 3}                                                 | 191/388 [04:24<04:14,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-3
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-3/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-3/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.023080185055733, 'learning_rate': 2.4000000000000003e-07, 'epoch': 0.6597938144329897, 'step': 4}                                  | 255/388 [05:56<02:50,  1.28s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-4
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-4/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-4/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.003526121377945, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.8247422680412371, 'step': 5}█████████▉                        | 319/388 [07:29<01:28,  1.29s/it]INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-5
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-5/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-5/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 11.993770495057106, 'learning_rate': 3.6e-07, 'epoch': 0.9896907216494846, 'step': 6}███████████████████████████████████████████████▎ | 383/388 [09:01<00:06,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-6
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-6/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-6/pytorch_model.bin
    Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:18<00:00,  1.44s/it]
    Epoch:   0%|▎                                                                                                                                        | 1/501 [09:18<77:36:08, 558.74s/it]                 INFO:transformers.trainer:{'loss': 12.672470852732658, 'learning_rate': 4.2e-07, 'epoch': 1.1649484536082475, 'step': 7}                                                   | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-7
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-7/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-7/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-8
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-8/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-8/pytorch_model.bin
    
    Iteration:  36%|███████████████████████████████████████████████████████▏                                                                                                 | 140/388 [03:21<05:27,  1.32s/iItINFO:transformers.trainer:{'loss': 11.813278079032898, 'learning_rate': 5.4e-07, 'epoch': 1.4948453608247423, 'step': 9}                                                  | 191/388 [04:27<04:15,  1.30s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-9
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-9/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-9/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-10
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-10/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-10/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-11
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-11/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-11/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-12
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-12/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-12/pytorch_model.bin
    Iteration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   0%|▌                                                                                                                                        | 2/501 [18:43<77:40:49, 560.42s/it]<00:00,  2.07s/it]INFO:transformers.trainer:{'loss': 12.117324143648148, 'learning_rate': 7.799999999999999e-07, 'epoch': 2.1649484536082473, 'step': 13}                                     | 63/388 [01:20<06:59,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-13
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-13/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-13/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-14
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-14/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-14/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-15
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-15/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-15/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-16
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-16/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-16/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-17
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-17/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-17/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-18
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-18/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-18/pytorch_model.bin
    Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [28:07<77:40:37, 561.52s/it]4<00:00,  2.07s/itINFO:transformers.trainer:{'loss': 11.206573352217674, 'learning_rate': 1.14e-06, 'epoch': 3.1649484536082473, 'step': 19}                                                  | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-19
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-19/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-19/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-20
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-20/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-20/pytorch_model.bin
    
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    Iteration:  39%|████████████████████████████████████████████████████████████▋                                                                                             | 153/388 [03:38<05:35,  1.43s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [31:45<87:51:44, 635.15s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 182, in forward
        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    RuntimeError: CUDA error: device-side assert triggered
    

    the same run with

    ###########################################

    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

    ###########################################

    ...
    Epoch:   1%|▉                                                                                                                                                          | 3/501 [30:52<85:25:53, 617.58s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 430, in forward
        encoder_attention_mask,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 155, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 420, in custom_forward
        return module(*inputs, output_attentions)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
        hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
        hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 243, in forward
        attention_scores = attention_scores + attention_mask
    RuntimeError: CUDA error: device-side assert triggered
    (transformers) gurbizu@azken:/mnt/datuak/gorka-tmp$ python BERTeus2LongB.py
    

    Any hint what causes this error?

    By the way, I also got sometimes this error, which I am not able to reproduce right now:

     File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      ...
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    

    Regards, Gorka

    opened by GorkaUrbizu 7
  • Number of tokens per batch mismatch - longformer vs roberta

    Number of tokens per batch mismatch - longformer vs roberta

    I see in your conversion notebook that you suggest that the number of tokens per batch should be the same as roberta: 2^18 = 260k

    When I look at the roberta paper, it says it uses a sequence length of 512 and a batch size of 8k. This means that each batch has 512*8k = 4M tokens

    Am I missing something?

    opened by nbroad1881 1
  • Answering performance of Longformer-base on the HotpotQA dev set

    Answering performance of Longformer-base on the HotpotQA dev set

    Hi,

    I only found Longformer-base's joint F1 on the HopotQA dev set from the paper, and I would like to know if my reproduction results (Ans EM = 61.38, Ans F1 = 75.18) are expected. Could you provide some more specific metrics?

    Thank you!

    opened by zycdev 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Updated BART to Longformer-encoder-decoder (LED) converter

    Updated BART to Longformer-encoder-decoder (LED) converter

    Hi @ibeltagy et al., I'm pre-training BART to Portuguese and converting the pre-trained model to LED following the instructions you gave in the paper and the code at https://github.com/allenai/longformer/blob/caefee668e39cacdece7dd603a0bebf24df6d8ca/scripts/convert_bart_to_longformerencoderdecoder.py.

    The huggingface library is evolving fast; unfortunately, the code you provided is outdated and I had to implement a new version based on yours.

    I have 2 questions:

    1. Could you tell me if everything is ok or if I missed something? https://gist.github.com/erichans/af745a381b28b1c019f96997ddac4cd7
    2. Is the LEDForConditionalGeneration model uploaded to huggingface just a BART model converted to LED or is there something else?

    Thanks in advance!

    opened by erichans 0
  • Why the TVM impelmentation is memroy efficient

    Why the TVM impelmentation is memroy efficient

    Thanks for your excellent work!

    Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

    Looking forward to your reply.

    opened by jlidw 0
  • Pretraining longformer for NER on big pdf text

    Pretraining longformer for NER on big pdf text

    Hi, I'm trying to extract entities from documents containing 50-60 pages per document. can anybody suggest a better approach for it, please? I couldn't find any NER implementation of longformers.

    opened by ajaysurya1221 0
Releases(v0.2)
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 7, 2021
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 7, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

null 1 Nov 16, 2021
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

null 84 Dec 15, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 6, 2021
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 2, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022