Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Overview

DocFormer - PyTorch

docformer architecture

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}
Comments
  • AssertionError:

    AssertionError:

    `AssertionError                            Traceback (most recent call last)
    <ipython-input-8-02f52eee118a> in <module>()
         25 
         26 tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
    ---> 27 encoding = dataset.create_features(fp, tokenizer)
         28 
         29 feature_extractor = modeling.ExtractFeatures(config)
    
    /content/docformer/src/docformer/dataset.py in create_features(image, tokenizer, add_batch_dim, target_size, max_seq_length, path_to_save, save_to_disk, apply_mask_for_mlm, extras_for_debugging)
        259         "y_features": torch.as_tensor(a_rel_y, dtype=torch.int32),
        260         })
    --> 261     assert torch.lt(encoding["x_features"], 0).sum().item() == 0
        262     assert torch.lt(encoding["y_features"], 0).sum().item() == 0
        263 
    
    AssertionError: 
    

    First I tried with png image, later converted to tif but still it is giving this error

    opened by deepanshudashora 12
  • Weird output

    Weird output

    Hi I ran the code, it is giving me final output that is too weird irrespective of changing the image. I am attaching it. Can you explain what it is?

    image

    Thanks

    opened by kmr2017 7
  • Error When Following the Usage Instructions

    Error When Following the Usage Instructions

    I tried following the usage instructions you posted on a sample .jpg image of a receipt. Every time I run it, I get an error saying, "RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 384, 500] instead". How do I fix that?

    Full code:

    import pytesseract
    import sys 
    sys.path.extend(['docformer/src/docformer/'])
    import modeling, dataset
    from transformers import BertTokenizerFast
    
    
    config = {
      "coordinate_size": 96,
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "image_feature_pool_shape": [7, 7, 256],
      "intermediate_ff_size_factor": 4,
      "max_2d_position_embeddings": 1000,
      "max_position_embeddings": 512,
      "max_relative_positions": 8,
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 0,
      "shape_size": 96,
      "vocab_size": 30522,
      "layer_norm_eps": 1e-12,
    }
    
    fp = "images/data_sample.jpg"
    
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
    encoding = dataset.create_features(fp, tokenizer)
    
    pytesseract.pytesseract.tesseract_cmd = r'‪C:\Program Files\Tesseract-OCR\tesseract.exe'
    
    feature_extractor = modeling.ExtractFeatures(config)
    docformer = modeling.DocFormerEncoder(config)
    
    v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
    output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)
    

    Full error:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    Input In [3], in <module>
         31 feature_extractor = modeling.ExtractFeatures(config)
         32 docformer = modeling.DocFormerEncoder(config)
    ---> 34 v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
         35 output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
       1098 # If we don't have any hooks, we want to skip the rest of the logic in
       1099 # this function, and just call forward.
       1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102     return forward_call(*input, **kwargs)
       1103 # Do not call functions when jit is used
       1104 full_backward_hooks, non_full_backward_hooks = [], []
    
    File ~\Documents\Projects\docformer_implementation\docformer/src/docformer\modeling.py:512, in ExtractFeatures.forward(self, encoding)
        509 x_feature = encoding['x_features']
        510 y_feature = encoding['y_features']
    --> 512 v_bar = self.visual_feature(image)
        513 t_bar = self.language_feature(language)
        515 v_bar_s, t_bar_s = self.spatial_feature(x_feature, y_feature)
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
       1098 # If we don't have any hooks, we want to skip the rest of the logic in
       1099 # this function, and just call forward.
       1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102     return forward_call(*input, **kwargs)
       1103 # Do not call functions when jit is used
       1104 full_backward_hooks, non_full_backward_hooks = [], []
    
    File ~\Documents\Projects\docformer_implementation\docformer/src/docformer\modeling.py:48, in ResNetFeatureExtractor.forward(self, x)
         47 def forward(self, x):
    ---> 48     x = self.resnet50(x)
         49     x = self.conv1(x)
         50     x = self.relu1(x)
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
       1098 # If we don't have any hooks, we want to skip the rest of the logic in
       1099 # this function, and just call forward.
       1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102     return forward_call(*input, **kwargs)
       1103 # Do not call functions when jit is used
       1104 full_backward_hooks, non_full_backward_hooks = [], []
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\container.py:141, in Sequential.forward(self, input)
        139 def forward(self, input):
        140     for module in self:
    --> 141         input = module(input)
        142     return input
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
       1098 # If we don't have any hooks, we want to skip the rest of the logic in
       1099 # this function, and just call forward.
       1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1101         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1102     return forward_call(*input, **kwargs)
       1103 # Do not call functions when jit is used
       1104 full_backward_hooks, non_full_backward_hooks = [], []
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\conv.py:446, in Conv2d.forward(self, input)
        445 def forward(self, input: Tensor) -> Tensor:
    --> 446     return self._conv_forward(input, self.weight, self.bias)
    
    File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\conv.py:442, in Conv2d._conv_forward(self, input, weight, bias)
        438 if self.padding_mode != 'zeros':
        439     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
        440                     weight, bias, self.stride,
        441                     _pair(0), self.dilation, self.groups)
    --> 442 return F.conv2d(input, weight, bias, self.stride,
        443                 self.padding, self.dilation, self.groups)
    
    RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 384, 500] instead
    
    opened by ynusinovich 7
  • DocFormer for Token Classification.

    DocFormer for Token Classification.

    Hi, First of all great work. I wanted to ask if DocFormer can be used for token classification like LayoutLM series models of Microsoft Research which support tasks like Token Classification, Document Image Classification and Visual Question-Answering and if it does how we can adapt the model to the task of token classification.

    opened by Akhilesh64 3
  • Again Device issue

    Again Device issue

    I am trying to the code. But I face problem

    when I execute below line: output = docformer(v_bar, t_bar, v_bar_s, t_bar_s) # shape (1, 512, 768)

    I get this error.

    /usr/local/lib/python3.7/dist-packages/torch/functional.py in einsum(*args) 328 return einsum(equation, *_operands) 329 --> 330 return _VF.einsum(equation, operands) # type: ignore[attr-defined] 331 332 # Wrapper around _histogramdd and _histogramdd_bin_edges needed due to (Tensor, Tensor[]) return type RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_bmm)

    This time, I face issue in wrapper_

    opened by kmr2017 3
  • Do you have a plan for pre-trained model and inference code?

    Do you have a plan for pre-trained model and inference code?

    HI, Thank you for sharing nice work.

    I'm new in Document Understanding, I wanna check the inference on FUNSD dataset. Could you share any pre-trained models, and inference code?

    :) i'm waiting your reply. thank you.

    opened by yellowjs0304 2
  • Detail information about pre-training dataset

    Detail information about pre-training dataset

    Thank you for the great work! I was wondering what criteria (ex. document words) were used to select the subset of 5 million pages from IIT-CDIP dataset in pre-training. To compare our model with docformer under the same condition, please share information or release a pre-process code.

    opened by rtanaka-lab 2
  • import error after installing via pip and via command python setup.py install

    import error after installing via pip and via command python setup.py install

    ModuleNotFoundError Traceback (most recent call last) in () ----> 1 from docformer import modeling, dataset

    ModuleNotFoundError: No module named 'docformer'

    opened by mayankpathaklumiq 2
  • FileNotFoundError: [Errno 2] No such file or directory: 'rvl_cdip_dataset.csv'

    FileNotFoundError: [Errno 2] No such file or directory: 'rvl_cdip_dataset.csv'

    I cannot find the rvl_cdip_dataset.csv file in your repository. It gives me an error in line 69 of dataset_creation_for_docformer.py Where do I find the rvl_cdip_dataset.csv file?

    opened by SoumyaroopNandi 2
  • Simplification of code and a bit faster

    Simplification of code and a bit faster

    The bottleneck remains the tesseract but still I tried to make it faster. If you don't find any bugs, please merge.

    Key changes:

    1. Use of lru_cache
    2. Use of word_ids for token-level bounding box duplication. This is a feature of BertTokenizerFast.
    opened by shabie 2
  • There is no SEP token appended

    There is no SEP token appended

    You have a [SEP] or an equivalent token at the end which I think is not what the authors used:

    https://github.com/shabie/docformer/blob/ae1ce38250d9e6ea2f9589fc11b43097045b2488/src/docformer/dataset.py#L260-L261

    See the first paragraph of the sub-section "Language Features" in the section 3.1

    opened by shabie 1
  • Inference for token classification.

    Inference for token classification.

    Hi @uakarsh, I trained docformer on a custom dataset and got a checkpoint file. I have no idea how to perform inference on test images. Can you show me how to do inference on images or if possible share a code snippet for the same.

    opened by Akhilesh64 1
  • Using pre-trained models

    Using pre-trained models

    Hello, thank you for the great work. I used this script to run the pre-training for MLM task: https://github.com/shabie/docformer/blob/master/examples/docformer_pl/3_Pre_training_DocFormer_Task_MLM_Task.ipynb Afterwards, I used the resulting model in the token-classification task. ( using load_from_check_point which copies all the weight except the linear layer which has a different shape).

    The problem is that no matter how much I run the pre-training, I always get the same metrics in the token-classification task (using that pre-trained model as a starting point).

    I even tried the model from document-classification task as a base for token classification and I still the get same exact metrics as the results I was getting from using the MLM-pretrained task.

    Any suggestion on how to properly use the pre-trained models?

    opened by HoomanKhosravi 16
  • Error in Example: Please provide the bounding box and words or pass the argument

    Error in Example: Please provide the bounding box and words or pass the argument "use_ocr" = True

    Ran into this error while running the example notebook.

    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    /tmp/ipykernel_33283/863471981.py in <cell line: 2>()
          1 ## Using a single batch for the forward propagation
    ----> 2 features = next(iter(train_data_loader))
          3 img,token,x_feat,y_feat, labels = features
    
    ~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
        679                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
        680                 self._reset()  # type: ignore[call-arg]
    --> 681             data = self._next_data()
        682             self._num_yielded += 1
        683             if self._dataset_kind == _DatasetKind.Iterable and \
    
    ~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
        719     def _next_data(self):
        720         index = self._next_index()  # may raise StopIteration
    --> 721         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        722         if self._pin_memory:
        723             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
    
    ~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         47     def fetch(self, possibly_batched_index):
         48         if self.auto_collation:
    ---> 49             data = [self.dataset[idx] for idx in possibly_batched_index]
         50         else:
         51             data = self.dataset[possibly_batched_index]
    
    ~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
         47     def fetch(self, possibly_batched_index):
         48         if self.auto_collation:
    ---> 49             data = [self.dataset[idx] for idx in possibly_batched_index]
         50         else:
         51             data = self.dataset[possibly_batched_index]
    
    /tmp/ipykernel_33283/2543102337.py in __getitem__(self, index)
         22         If labels are not None, then labels also
         23         '''
    ---> 24         encoding = create_features(self.entries[index],self.tokenizer, apply_mask_for_mlm=self.use_mlm)
         25 
         26         if self.labels==None:
    
    ~/docformer/examples/../src/docformer/dataset.py in create_features(image, tokenizer, add_batch_dim, target_size, max_seq_length, path_to_save, save_to_disk, apply_mask_for_mlm, extras_for_debugging, use_ocr, bounding_box, words)
        190 
        191     if (use_ocr == False) and (bounding_box == None or words == None):
    --> 192         raise Exception('Please provide the bounding box and words or pass the argument "use_ocr" = True')
        193 
        194     if use_ocr == True:
    
    Exception: Please provide the bounding box and words or pass the argument "use_ocr" = True
    
    opened by bbalaji-ucsd 1
  • Shape mismatch during sanity check

    Shape mismatch during sanity check

    Can the target size be changed. Currently when I try to change it throws mat1 and mat2 shapes cannot be multiplied (3072x768 and 192x128) . I tried with target size (1000, 768). Tried setting max_position_embeddings to 768 and it throws stack expects each tensor to be equal size, but got [767, 8] at entry 0 and [768, 8] at entry 2. I am trying to replicate document classification task from the notebook provided.

    opened by Akhilesh64 1
Owner
null
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Alex Pashevich 62 Dec 24, 2022
DUE: End-to-End Document Understanding Benchmark

This is the repository that provide tools to download data, reproduce the baseline results and evaluation. What can you achieve with this guide Based

null 21 Dec 29, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features | paper | Official PyTorch implementation for Mul

null 48 Dec 28, 2022
Task-based end-to-end model learning in stochastic optimization

Task-based End-to-end Model Learning in Stochastic Optimization This repository is by Priya L. Donti, Brandon Amos, and J. Zico Kolter and contains th

CMU Locus Lab 164 Dec 29, 2022
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 7, 2023
A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

PythicCoder 39 Dec 12, 2022
A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

Object Pose Estimation Demo This tutorial will go through the steps necessary to perform pose estimation with a UR3 robotic arm in Unity. You’ll gain

Unity Technologies 187 Dec 24, 2022
A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

A 3D multi-modal medical image segmentation library in PyTorch We strongly believe in open and reproducible deep learning research. Our goal is to imp

Adaloglou Nikolas 1.2k Dec 27, 2022
Multi-Modal Machine Learning toolkit based on PyTorch.

简体中文 | English TorchMM 简介 多模态学习工具包 TorchMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 TorchMM 初始版本 v1.0 特性 丰富的任务场景:工具

njustkmg 1 Jan 5, 2022
Multi-Modal Machine Learning toolkit based on PaddlePaddle.

简体中文 | English PaddleMM 简介 飞桨多模态学习工具包 PaddleMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 PaddleMM 初始版本 v1.0 特性 丰富的任务

njustkmg 520 Dec 28, 2022
Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

TGraM Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling, Qibin He, Xian Sun, Zhiyuan Yan, Beibei Li, Kun Fu Abstract Rece

Qibin He 6 Nov 25, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

null 41 Jan 3, 2023
code for paper "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?"

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? Code for paper: Does Unsupervised Architecture Representation

null 39 Dec 17, 2022
This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

null 348 Jan 7, 2023
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022