Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Overview

Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-trained models, please send your resume to [email protected].

Pre-trained Models

Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

Language & Multilingual

UniLM: unified pre-training for language understanding and generation

InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages

DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages

MiniLM: small and fast pre-trained models for language understanding and generation

AdaLM: domain, language, and task adaptation of pre-trained models

Vision

BEiT (NEW): generative self-supervised pre-training for image / BERT Pre-Training of Image Transformers

Speech

WavLM (NEW): speech pre-training for full stack tasks

Multimodal (X + Language)

LayoutLM: multimodal (text + layout/format + image) pre-training for Document AI (e.g. scanned documents, PDF, etc.)

LayoutXLM: multimodal (text + layout/format + image) pre-training for multilingual document understanding

MarkupLM (NEW): markup language model pre-training for visually-rich document understanding

UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR

UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training

SpeechT5 (NEW): encoder-decoder pre-training for spoken language processing

VLMo (NEW): Unified vision-language pre-training - evolution of BEiT to multimodal

Toolkits

s2s-ft: sequence-to-sequence fine-tuning toolkit

Applications

TrOCR (NEW): transformer-based OCR w/ pre-trained models

LayoutReader: pre-training of text and layout for reading order detection

XLM-T: multilingual NMT w/ pretrained cross-lingual encoders

News

  • [Model Release] December 16th, 2021: TrOCR small models for handwritten and printed texts.
  • November 24th, 2021: VLMo as the new SOTA on the VQA Challenge
  • November, 2021: Multilingual translation at scale: 10000 language pairs and beyond
  • [Model Release] November, 2021: MarkupLM
  • [Model Release] November, 2021: VLMo - Unified vision-language pre-training w/ BEiT
  • [Model Release] October, 2021: WavLM - Large-scale self-supervised pre-trained models for speech.
  • [Model Release] October 2021: TrOCR is on HuggingFace
  • September 28th, 2021: T-ULRv5 (aka XLM-E/InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
  • [Model Release] September, 2021: LayoutLM-cased are on HuggingFace
  • [Model Release] September, 2021: TrOCR - Transformer-based OCR w/ pre-trained BEiT and RoBERTa models.
  • August 2021: LayoutLMv2 and LayoutXLM are on HuggingFace
  • [Model Release] August, 2021: LayoutReader - Built with LayoutLM to improve general reading order detection.
  • [Model Release] August, 2021: DeltaLM - Encoder-decoder pre-training for language generation and translation.
  • August 2021: BEiT is on HuggingFace
  • [Model Release] July, 2021: BEiT - Towards BERT moment for CV
  • [Model Release] June, 2021: LayoutLMv2, LayoutXLM, MiniLMv2, and AdaLM.
  • May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
  • April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
  • March, 2021: InfoXLM was accepted by NAACL 2021.
  • December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide varierty of document AI tasks, including DocVQA and SROIE leaderboard.
  • October 8th, 2020: T-ULRv2 (aka InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
  • September, 2020: MiniLM was accepted by NeurIPS 2020.
  • July 16, 2020: InfoXLM (Multilingual UniLM) arXiv
  • June, 2020: UniLMv2 was accepted by ICML 2020; LayoutLM was accepted by KDD 2020.
  • April 5, 2020: Multilingual MiniLM released!
  • September, 2019: UniLMv1 was accepted by NeurIPS 2019.

Release

***** New October, 2021: WavLM release *****

  • WavLM (October 27, 2021): WavLM, a new pre-trained speech model, to solve full-stack downstream speech tasks. WavLM integrates the gated relative position embedding structure and the utterance mixing method, to model both spoken content and speaker identity preservation. WavLM is trained on 94k hours of public audio data, which is larger than other released checkpoints for English Speech modeling. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. "WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing"

***** New October, 2021: MarkupLM release *****

  • MarkupLM (October 19, 2021): MarkupLM, a simple yet effective pre-training approach for text and markup language. With the Transformer architecture, MarkupLM integrates different input embeddings including text embeddings, position embeddings, and XPath embeddings. Furthermore, we also propose new pre-training objectives that are specially designed for understanding the markup language. We evaluate the pre-trained MarkupLM model on the WebSRC and SWDE datasets. Experiments show that MarkupLM significantly outperforms several SOTA baselines in these tasks. "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding"

***** September, 2021: TrOCR release *****

  • TrOCR (September 22, 2021): Transformer-based OCR with pre-trained models, which leverages the Transformer architecture for both image understanding and bpe-level text generation. The TrOCR model is simple but effective (convolution free), and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models"

***** August, 2021: LayoutReader release *****

***** August, 2021: DeltaLM release *****

***** July, 2021: BEiT release *****

***** June, 2021: LayoutXLM | AdaLM | MiniLMv2 release *****

***** May, 2021: LayoutLMv2 | LayoutXLM release *****

  • LayoutLM 2.0 (December 29, 2020): multimodal pre-training for visually-rich document understanding by leveraging text, layout and image information in a single framework. It is coming with new SOTA on a wide range of document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding ACL 2021"

***** February, 2020: UniLM v2 | MiniLM v1 | LayoutLM v1 | s2s-ft v1 release *****

***** October 1st, 2019: UniLM v1 release *****

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue.

For other communications, please contact Furu Wei ([email protected]).

Comments
  • [LayoutLM] How to reproduce FUNSD result

    [LayoutLM] How to reproduce FUNSD result

    Hello, I have run fine tuning for the Sequence Labeling Task with FUNSD dataset but my I couldn't achieve the result presented in the paper (precision is only 40%), here are some scripts and log that I used, any idea about what could be wrong? Thank you very much. Training:

    #!/bin/bash
    
    python run_seq_labeling.py  --data_dir ~/mnt/data \
                                --model_type layoutlm \
                                --model_name_or_path ~/mnt/model \
                                --do_lower_case \
                                --max_seq_length 512 \
                                --do_train \
                                --num_train_epochs 100.0 \
                                --logging_steps 10 \
                                --save_steps -1 \
                                --output_dir ~/mnt/output \
                                --labels ~/mnt/data/labels.txt \
                                --per_gpu_train_batch_size 16 \
                                --fp16
    

    Testing:

    #!/bin/bash
    
    python run_seq_labeling.py --do_predict\
      --model_type layoutlm\
      --model_name_or_path ~/mnt/model\
      --data_dir ~/mnt/data\
      --output_dir ~/mnt/output\
      --labels ~/mnt/data/labels.txt
    

    Some log:

    05/14/2020 09:40:45 - INFO - __main__ -   ***** Running training *****
    05/14/2020 09:40:45 - INFO - __main__ -     Num examples = 150
    05/14/2020 09:40:45 - INFO - __main__ -     Num Epochs = 100
    05/14/2020 09:40:45 - INFO - __main__ -     Instantaneous batch size per GPU = 16
    05/14/2020 09:40:45 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
    05/14/2020 09:40:45 - INFO - __main__ -     Gradient Accumulation steps = 1
    05/14/2020 09:40:45 - INFO - __main__ -     Total optimization steps = 1000
    05/14/2020 09:53:00 - INFO - __main__ -    global_step = 1000, average loss = 0.10387736940692412
    
    05/14/2020 10:17:07 - INFO - __main__ -   ***** Running evaluation  *****
    05/14/2020 10:17:07 - INFO - __main__ -     Num examples = 52
    05/14/2020 10:17:07 - INFO - __main__ -     Batch size = 8
    05/14/2020 10:17:07 - INFO - __main__ -   
               precision    recall  f1-score   support
    
     QUESTION       0.41      0.70      0.52       771
       HEADER       0.00      0.00      0.00       108
       ANSWER       0.39      0.50      0.44       513
    
    micro avg       0.40      0.57      0.47      1392
    macro avg       0.37      0.57      0.45      1392
    
    05/14/2020 10:17:07 - INFO - __main__ -   ***** Eval results  *****
    05/14/2020 10:17:07 - INFO - __main__ -     f1 = 0.472115668338743
    05/14/2020 10:17:07 - INFO - __main__ -     loss = 2.9291565077645436
    05/14/2020 10:17:07 - INFO - __main__ -     precision = 0.400600901352028
    05/14/2020 10:17:07 - INFO - __main__ -     recall = 0.5747126436781609
    
    opened by nv-quan 17
  • LayoutLMv2 code release

    LayoutLMv2 code release

    Thanks for sharing the great work! I was wondering if there is an expected date on when you will be releasing your code and pre-trained models for LayoutLMv2.

    opened by rtanaka-lab 14
  • Unable to get vaild value from layoutlm model

    Unable to get vaild value from layoutlm model

    Hi there,

    Thank you for your works very much and i'd like to use the LayoutLM network to try labeling in seq but I'm in trouble due to the question detail as follow:

    for the env, i try to setup as readme.md but in the command pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ there is a trouble so i use my env existed which also meet the requirement;

    for the data, i download the FUNSD and preprocessing its testing_data follow the readme.md;

    for the main, I run it in the run_seq_labeling.py but not command after preprocessing.I can finish the process of training but can not for the one of predicting. my conf is equivalent to the follow command: python run_seq_labeling.py --data_dir data
    --model_type layoutlm
    --model_name_or_path layout-large-uncased \ (Downloaded form google drive) --output_dir out
    --labels data/labels.txt
    --do_predict
    --do_lower_case
    --overwrite_output_dir
    and other command is not change from default. It will be error in line 357: eval_loss += tmp_eval_loss.item()

    and the error is RuntimeError: CUDA error: device-side assert triggered

    I debug it and this error may be created from line 349: outputs = model(input)

    due to the input is a dict of tensor contains {input_ids, attention_mask, labels, bbox, token_type_ids} but output is a tuple which contains 2 tensor but their data are Unable to get repr for 'torch.Tensor';

    i run it as I understand it from paper but i do not know whether it is correct, and i have spend a long time in it but get none result, could you please provide some help for me. sincerely thanks;

    opened by NancyNozomi 14
  • [unused99] occurs two times in unilm1.2-base-uncased-vocab.txt

    [unused99] occurs two times in unilm1.2-base-uncased-vocab.txt

    when running s2s-ft, [unused99] occurs two times in unilm1.2-base-uncased-vocab.txt, which makes incorrect seq2seq_loader.Preprocess4Seq2seqDecoder.vocab_words

    opened by GentleSmile 14
  • Question Generation example for S2S-FT

    Question Generation example for S2S-FT

    Hi Li Dong Thanks a lot for your code and the S2S-FT example. You gave us in the README abstractive summarization usage, could you please also give Question Generation usage example ? Thanks in advance Philippe PS: I have another question, but better on a separate issue for tracking

    opened by Neuronys 13
  • Issue in the backbone of LayoutLMv2

    Issue in the backbone of LayoutLMv2

    Describe the bug Model I am using: LayoutLMv2

    While trying to train LayoutLMv2, I got an error saying,

    ~/orbi/unilm/layoutlmft/layoutlmft/models/layoutlmv2/modeling_layoutlmv2.py in forward(self, images)
        605 
        606     def forward(self, images):
    --> 607         images_input = (images.tensor - self.pixel_mean) / self.pixel_std
        608         features = self.backbone(images_input)
        609         features = features[self.out_feature_key]
    
    AttributeError: 'Tensor' object has no attribute 'tensor'
    

    I tried running it without .tensor and it worked fine.

    Please let me know if there is any mistake in the way I am passing the images. Otherwise, I can open a PR to remove .tensor.

    Thanks!

    cc: @ranpox

    opened by rushabh-v 12
  • KeyError: 'layoutlmv2' in AutoTokenizer

    KeyError: 'layoutlmv2' in AutoTokenizer

    I get key error, when I try to run AutoTokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased") same is the case even when I download the files to local and run the above with path to the config folder

    I also do not find layoutlmv2 in the AutoTokenizer.from_pretrained Documentation.

    Any leads on how to use it the right way would be helpful

    opened by sindhuattaiger 12
  • Is transformers LayoutLM really working?

    Is transformers LayoutLM really working?

    Getting endless erros when trying to use the LayoutLMForTokenClassification from transformers for NER task, is just me doing wrong or the function still on work? Really appreciate if you can give some information.

    opened by shaonanqinghuaizongshishi 12
  • RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

    RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

    Hi, I was using run_classification.py with the base model of layoutlm for my own document set. However, I came across this error:

    04/25/2020 09:26:55 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
    04/25/2020 09:26:55 - INFO - transformers.configuration_utils -   loading configuration file /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm-base-uncased/config.json
    04/25/2020 09:26:55 - INFO - transformers.configuration_utils -   Model config {
      "attention_probs_dropout_prob": 0.1,
      "finetuning_task": "cdip",
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "is_decoder": false,
      "layer_norm_eps": 1e-12,
      "max_2d_position_embeddings": 1024,
      "max_position_embeddings": 512,
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "num_labels": 2,
      "output_attentions": false,
      "output_hidden_states": false,
      "output_past": true,
      "pruned_heads": {},
      "torchscript": false,
      "type_vocab_size": 2,
      "use_bfloat16": false,
      "vocab_size": 30522
    }
    
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   Model name '/home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). Assuming '/home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/' is a path or url to a directory containing tokenizer files.
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   Didn't find file /home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/added_tokens.json. We won't load it.
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   loading file /home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/vocab.txt
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   loading file None
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   loading file /home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/special_tokens_map.json
    04/25/2020 09:26:55 - INFO - transformers.tokenization_utils -   loading file /home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/tokenizer_config.json
    04/25/2020 09:26:55 - INFO - transformers.modeling_utils -   loading weights file /home/ubuntu/rwik-xx/unilm/layoutlm-base-uncased/pytorch_model.bin
    04/25/2020 09:27:16 - INFO - transformers.modeling_utils -   Weights of LayoutLMForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
    04/25/2020 09:27:16 - INFO - transformers.modeling_utils -   Weights from pretrained model not used in LayoutLMForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
    04/25/2020 09:27:19 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/home/ubuntu/rwik-xx/xx_pdftotext_html_output/', dev_folder='trial_24_04_test', device=device(type='cuda', index=0), do_eval=True, do_lower_case=True, do_train=True, eval_all_checkpoints=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=2, hierarchical_tokens=False, learning_rate=5e-05, local_rank=-1, logging_steps=100, max_grad_norm=1.0, max_seq_length=448, max_steps=10000, model_name_or_path='/home/ubuntu/rwik_xx_document_classification/unilm/layoutlm-base-uncased/', model_type='layoutlm', n_gpu=1, no_cuda=False, nuance_mode=True, num_train_epochs=4.0, output_dir='/home/ubuntu/rwik_xx_document_classification/weight_files/1_layout_lm_base_24_04/', output_mode='classification', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=8, save_steps=100, seed=566, server_ip='', server_port='', stride_len=112, task_name='cdip', tokenizer_name='', tpu=False, tpu_ip_address='', tpu_name='', tqdm_notebook_mode=False, train_folder='trial_24_04_train', warmup_steps=500, weight_decay=0.0, xrt_tpu_config='')
    04/25/2020 09:27:19 - INFO - __main__ -   Creating features from dataset file at /home/ubuntu/rwik-xx/xx_pdftotext_html_output/
    Gettting train examples: 100%|██████████| 789/789 [00:26<00:00, 29.34it/s]
      0%|          | 0/789 [00:00<?, ?it/s]04/25/2020 09:27:46 - INFO - utils_classification -   *** Example ***
    04/25/2020 09:27:46 - INFO - utils_classification -   guid: train-1
    04/25/2020 09:27:46 - INFO - utils_classification -   input_ids: 101 9986 2271 23773 11255 8909 1024 5709... ( truncated )
    04/25/2020 09:27:46 - INFO - utils_classification -   bboxes: [0, 0, 0, 0] [0, 0, 0, 0] [55, 44, 109, 53] [55, 44, 109, 53] [55, 44, 109, 53] [112, 44, 164, 53] ......[1000, 1000, 1000, 1000] ( truncated )
    04/25/2020 09:27:46 - INFO - utils_classification -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 ... ( truncated )
    04/25/2020 09:27:46 - INFO - utils_classification -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 ... ( truncated )
    04/25/2020 09:27:46 - INFO - utils_classification -   label: xxx (id = 1)
    04/25/2020 09:27:46 - INFO - utils_classification -   *** Example ***
    .
    .
    .
    (truncated)
    
    100%|██████████| 789/789 [00:30<00:00, 26.17it/s]
    04/25/2020 09:28:16 - INFO - __main__ -   Saving features into cached file /home/ubuntu/rwik_xx/xx_pdftotext_html_output/cached_train_layoutlm-base-uncased_448_cdip
    04/25/2020 09:28:19 - INFO - __main__ -   ***** Running training *****
    04/25/2020 09:28:19 - INFO - __main__ -     Num examples = 1929
    04/25/2020 09:28:19 - INFO - __main__ -     Num Epochs = 83
    04/25/2020 09:28:19 - INFO - __main__ -     Instantaneous batch size per GPU = 8
    04/25/2020 09:28:19 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
    04/25/2020 09:28:19 - INFO - __main__ -     Gradient Accumulation steps = 2
    04/25/2020 09:28:19 - INFO - __main__ -     Total optimization steps = 10000
    Epoch:   0%|          | 0/83 [00:00<?, ?it/s]
    Iteration:   0%|          | 0/242 [00:00<?, ?it/s]
    Epoch:   0%|          | 0/83 [00:00<?, ?it/s]
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm/run_classification.py in <module>()
        925 
        926 if __name__ == "__main__":
    --> 927     main()
    
    /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm/run_classification.py in main()
        859             args, args.task_name, tokenizer, evaluate=False
        860         )
    --> 861         global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        862         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
        863 
    
    /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm/run_classification.py in train(args, train_dataset, model, tokenizer)
        217                 )  # XLM, DistilBERT and RoBERTa don't use segment_ids
        218             #pdb.set_trace()
    --> 219             outputs = model(**inputs)
        220             loss = outputs[
        221                 0
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm/modeling_layoutlm.py in forward(self, input_ids, bbox, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels)
        328             token_type_ids=token_type_ids,
        329             position_ids=position_ids,
    --> 330             head_mask=head_mask,
        331         )
        332 
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/rwik_xx_document_classification/unilm/layoutlm/modeling_layoutlm.py in forward(self, input_ids, bbox, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
        191         )
        192         encoder_outputs = self.encoder(
    --> 193             embedding_output, extended_attention_mask, head_mask=head_mask
        194         )
        195         sequence_output = encoder_outputs[0]
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
        378                 all_hidden_states = all_hidden_states + (hidden_states,)
        379 
    --> 380             layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask)
        381             hidden_states = layer_outputs[0]
        382 
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
        349 
        350     def forward(self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None):
    --> 351         self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
        352         attention_output = self_attention_outputs[0]
        353         outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
        303 
        304     def forward(self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None):
    --> 305         self_outputs = self.self(hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
        306         attention_output = self.output(self_outputs[0], hidden_states)
        307         outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/.local/lib/python3.6/site-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
        213 
        214     def forward(self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None):
    --> 215         mixed_query_layer = self.query(hidden_states)
        216 
        217         # If this is instantiated as a cross-attention module, the keys
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
        548             result = self._slow_forward(*input, **kwargs)
        549         else:
    --> 550             result = self.forward(*input, **kwargs)
        551         for hook in self._forward_hooks.values():
        552             hook_result = hook(self, input, result)
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
         85 
         86     def forward(self, input):
    ---> 87         return F.linear(input, self.weight, self.bias)
         88 
         89     def extra_repr(self):
    
    /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
       1610         ret = torch.addmm(bias, input, weight.t())
       1611     else:
    -> 1612         output = input.matmul(weight.t())
       1613         if bias is not None:
       1614             output += bias
    
    RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
    

    I am using torch version 1.3.0 and transformers version version 2.2.1 and Python 3.

    Note that I have modified the loading process to suit the dataset format that my file is present in and I have verified that it is being loaded correctly. In addition to that, I tried running roberta-base through the same dataset and it worked.

    When I tried to run python debugger and check the value of the input variable, I got this error

    > /home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/functional.py(1612)linear()
       1610         ret = torch.addmm(bias, input, weight.t())
       1611     else:
    -> 1612         output = input.matmul(weight.t())
       1613         if bias is not None:
       1614             output += bias
    
    ipdb> input
    *** RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
    ipdb> weight
    *** RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
    
    opened by rwikdutta 12
  • Question about pre-training BEiT-base on ImageNet-1k and then fine-tuning on ADE20K

    Question about pre-training BEiT-base on ImageNet-1k and then fine-tuning on ADE20K

    Describe Model I am using (UniLM, MiniLM, LayoutLM ...): BEIT I am try to reproducing self-supervised pre-training BEiT-base on ImageNet-1k and then fine-tuning on ADE20K, in your paper it will get mIoU 45.6, slightly higher than Supervised Pre-Training on ImageNet(45.3) and DINO(44.1), But i can not reproduce this result, i only get mIoU about 39.

    I wonder whether you can provide all the training commands and hyperparams for self-supervised pre-training BEiT-base on ImageNet-1k and then fine-tuning on ADE20K. Thanks~

    opened by laojiangwei 11
  • Can LayoutLM be used for commercial purpose?

    Can LayoutLM be used for commercial purpose?

    **Can LayoutLM be used for commercial purposes? And how LayoutLM license is different than other versions of LayoutLM (LayoutLMv2, LayoutLMFT, layoutXLM) Will license hold for both train model and code. Or one can use a trained model provide by other sources such as Docbank for commercial purposes.

    I have noticed that the LayoutLM folder is showing deprecated. What does that mean in terms of use and license? **

    Model I am using (LayoutLM):

    opened by ghost 11
  • ValueError: You must provide corresponding bounding boxes

    ValueError: You must provide corresponding bounding boxes

    Hi,

    I got one question when I try to use LayoutLMv2 to train on the FUNSD dataset. I follow the steps and I got the following issue ValueError: You must provide corresponding bounding boxes.

    I follow the command to train the LayoutLMv2 on FUNSD dataset.

    Here is the screenshot for the issue. Any advice will be appreciated! image

    opened by 14H034160212 0
  • Beit3 classification

    Beit3 classification

    Thanks for your inspiring and effective work. You achieved really great performance on imagenet-1k classification. As the paper mentioned, you treat classification as retrieval and do intermediate retrieval on imagenet-21k before fine-tuning. But we are wondering what you do for polysemy and same class names for different classes in the dataset imagenet-21k.

    Looking forward to your reply.

    opened by amandaluof 1
  • Could you please release the training pipeline of BEATs?

    Could you please release the training pipeline of BEATs?

    As the paper, the training pipeline/framework of BEATs is complex, and is the core contribution of it. @[email protected] @[email protected] Thank you!

    BEATs: Audio Pre-Training with Acoustic Tokenizers

    opened by hllmathcs 0
  • License for LayoutLMv2

    License for LayoutLMv2

    Hi @wolfshow We noticed that the license for LayoutLMv2 has been changed to Non-Commercial one. Would this license change apply to only model weights/code downloaded after sept 19 (the day when you changed the license)? Can people who downloaded it prior to the license modification date continue to use for commercial purpose?

    opened by riteshKumarUMass 0
  • Bump pillow from 8.3.1 to 9.3.0 in /vlmo

    Bump pillow from 8.3.1 to 9.3.0 in /vlmo

    Bumps pillow from 8.3.1 to 9.3.0.

    Release notes

    Sourced from pillow's releases.

    9.3.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.3.0 (2022-10-29)

    • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

    • Initialize libtiff buffer when saving #6699 [radarhere]

    • Inline fname2char to fix memory leak #6329 [nulano]

    • Fix memory leaks related to text features #6330 [nulano]

    • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

    • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

    • Fixed set_variation_by_name offset #6445 [radarhere]

    • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

    • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

    • Added ExifTags enums #6630 [radarhere]

    • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

    • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

    • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

    • Added GPS TIFF tag info #6661 [radarhere]

    • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

    • Do not attempt normalization if mode is already normal #6644 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • [DiT] Question regarding image resolutions on different tasks

    [DiT] Question regarding image resolutions on different tasks

    First of all, thanks for your general great work on document AI and specifically DiT in this case.

    In the paper, it says

    Since the image resolution for object detection tasks is much larger than classification, we limit the batch size to 16.

    which confuses me. If I understand correctly, when using a pretrained version of DiT, one is ALWAYS limited by the 224x224 image resolution, since this is constrained by the size of the patch embeddings (similar to how e.g. BERT-base simply can't go beyond 512 tokens due to the position embeddings). So regardless of the original size of the image, the input the model gets is always limited to this predefined 224x224.

    IF this reasoning is correct, then I cannot comprehend the logic behind resizing random crops of an image as described in the paper:

    Specifically, the input image is cropped with probability 0.5 to a random rectangular patch which is then resized again such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1,333.

    Any clarification for this would be very much appreciated, thanks in advance!

    opened by mrvoh 1
Releases(s2s-ft.v0.3)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Rhasspy 673 Dec 28, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

null 43 Nov 11, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 2, 2023
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022