Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Last update: Jan 6, 2023

Related tags

Deep Learning docformer

Overview

DocFormer - PyTorch

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}

Comments

AssertionError:

`AssertionError                            Traceback (most recent call last)
<ipython-input-8-02f52eee118a> in <module>()
     25 
     26 tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
---> 27 encoding = dataset.create_features(fp, tokenizer)
     28 
     29 feature_extractor = modeling.ExtractFeatures(config)

/content/docformer/src/docformer/dataset.py in create_features(image, tokenizer, add_batch_dim, target_size, max_seq_length, path_to_save, save_to_disk, apply_mask_for_mlm, extras_for_debugging)
    259         "y_features": torch.as_tensor(a_rel_y, dtype=torch.int32),
    260         })
--> 261     assert torch.lt(encoding["x_features"], 0).sum().item() == 0
    262     assert torch.lt(encoding["y_features"], 0).sum().item() == 0
    263 

AssertionError:

First I tried with png image, later converted to tif but still it is giving this error

opened by deepanshudashora 12

Weird output

Hi I ran the code, it is giving me final output that is too weird irrespective of changing the image. I am attaching it. Can you explain what it is?

Thanks

opened by kmr2017 7

Error When Following the Usage Instructions

I tried following the usage instructions you posted on a sample .jpg image of a receipt. Every time I run it, I get an error saying, "RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 384, 500] instead". How do I fix that?

Full code:

import pytesseract
import sys 
sys.path.extend(['docformer/src/docformer/'])
import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "images/data_sample.jpg"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

pytesseract.pytesseract.tesseract_cmd = r'‪C:\Program Files\Tesseract-OCR\tesseract.exe'

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)

v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

Full error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [3], in <module>
     31 feature_extractor = modeling.ExtractFeatures(config)
     32 docformer = modeling.DocFormerEncoder(config)
---> 34 v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
     35 output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~\Documents\Projects\docformer_implementation\docformer/src/docformer\modeling.py:512, in ExtractFeatures.forward(self, encoding)
    509 x_feature = encoding['x_features']
    510 y_feature = encoding['y_features']
--> 512 v_bar = self.visual_feature(image)
    513 t_bar = self.language_feature(language)
    515 v_bar_s, t_bar_s = self.spatial_feature(x_feature, y_feature)

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~\Documents\Projects\docformer_implementation\docformer/src/docformer\modeling.py:48, in ResNetFeatureExtractor.forward(self, x)
     47 def forward(self, x):
---> 48     x = self.resnet50(x)
     49     x = self.conv1(x)
     50     x = self.relu1(x)

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\container.py:141, in Sequential.forward(self, input)
    139 def forward(self, input):
    140     for module in self:
--> 141         input = module(input)
    142     return input

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\conv.py:446, in Conv2d.forward(self, input)
    445 def forward(self, input: Tensor) -> Tensor:
--> 446     return self._conv_forward(input, self.weight, self.bias)

File ~\anaconda3\envs\docformer_env\lib\site-packages\torch\nn\modules\conv.py:442, in Conv2d._conv_forward(self, input, weight, bias)
    438 if self.padding_mode != 'zeros':
    439     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    440                     weight, bias, self.stride,
    441                     _pair(0), self.dilation, self.groups)
--> 442 return F.conv2d(input, weight, bias, self.stride,
    443                 self.padding, self.dilation, self.groups)

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 384, 500] instead

opened by ynusinovich 7

DocFormer for Token Classification.

Hi, First of all great work. I wanted to ask if DocFormer can be used for token classification like LayoutLM series models of Microsoft Research which support tasks like Token Classification, Document Image Classification and Visual Question-Answering and if it does how we can adapt the model to the task of token classification.

opened by Akhilesh64 3
Again Device issue

I am trying to the code. But I face problem

when I execute below line: output = docformer(v_bar, t_bar, v_bar_s, t_bar_s) # shape (1, 512, 768)

I get this error.

/usr/local/lib/python3.7/dist-packages/torch/functional.py in einsum(*args) 328 return einsum(equation, *_operands) 329 --> 330 return _VF.einsum(equation, operands) # type: ignore[attr-defined] 331 332 # Wrapper around _histogramdd and _histogramdd_bin_edges needed due to (Tensor, Tensor[]) return type RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_bmm)

This time, I face issue in wrapper_

opened by kmr2017 3
Do you have a plan for pre-trained model and inference code?

HI, Thank you for sharing nice work.

I'm new in Document Understanding, I wanna check the inference on FUNSD dataset. Could you share any pre-trained models, and inference code?

:) i'm waiting your reply. thank you.

opened by yellowjs0304 2
Detail information about pre-training dataset

Thank you for the great work! I was wondering what criteria (ex. document words) were used to select the subset of 5 million pages from IIT-CDIP dataset in pre-training. To compare our model with docformer under the same condition, please share information or release a pre-process code.

opened by rtanaka-lab 2
import error after installing via pip and via command python setup.py install

ModuleNotFoundError Traceback (most recent call last) in () ----> 1 from docformer import modeling, dataset

ModuleNotFoundError: No module named 'docformer'

opened by mayankpathaklumiq 2
FileNotFoundError: [Errno 2] No such file or directory: 'rvl_cdip_dataset.csv'

I cannot find the rvl_cdip_dataset.csv file in your repository. It gives me an error in line 69 of dataset_creation_for_docformer.py Where do I find the rvl_cdip_dataset.csv file?

opened by SoumyaroopNandi 2
Simplification of code and a bit faster
The bottleneck remains the tesseract but still I tried to make it faster. If you don't find any bugs, please merge.

Key changes:

Use of lru_cache

Use of word_ids for token-level bounding box duplication. This is a feature of BertTokenizerFast.
opened by shabie 2
There is no SEP token appended

You have a [SEP] or an equivalent token at the end which I think is not what the authors used:

https://github.com/shabie/docformer/blob/ae1ce38250d9e6ea2f9589fc11b43097045b2488/src/docformer/dataset.py#L260-L261

See the first paragraph of the sub-section "Language Features" in the section 3.1

opened by shabie 1
Inference for token classification.

Hi @uakarsh, I trained docformer on a custom dataset and got a checkpoint file. I have no idea how to perform inference on test images. Can you show me how to do inference on images or if possible share a code snippet for the same.

opened by Akhilesh64 1
Using pre-trained models

Hello, thank you for the great work. I used this script to run the pre-training for MLM task: https://github.com/shabie/docformer/blob/master/examples/docformer_pl/3_Pre_training_DocFormer_Task_MLM_Task.ipynb Afterwards, I used the resulting model in the token-classification task. ( using load_from_check_point which copies all the weight except the linear layer which has a different shape).

The problem is that no matter how much I run the pre-training, I always get the same metrics in the token-classification task (using that pre-trained model as a starting point).

I even tried the model from document-classification task as a base for token classification and I still the get same exact metrics as the results I was getting from using the MLM-pretrained task.

Any suggestion on how to properly use the pre-trained models?

opened by HoomanKhosravi 16

Error in Example: Please provide the bounding box and words or pass the argument "use_ocr" = True

Ran into this error while running the example notebook.

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/tmp/ipykernel_33283/863471981.py in <cell line: 2>()
      1 ## Using a single batch for the forward propagation
----> 2 features = next(iter(train_data_loader))
      3 img,token,x_feat,y_feat, labels = features

~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    679                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
    680                 self._reset()  # type: ignore[call-arg]
--> 681             data = self._next_data()
    682             self._num_yielded += 1
    683             if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    719     def _next_data(self):
    720         index = self._next_index()  # may raise StopIteration
--> 721         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    722         if self._pin_memory:
    723             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     47     def fetch(self, possibly_batched_index):
     48         if self.auto_collation:
---> 49             data = [self.dataset[idx] for idx in possibly_batched_index]
     50         else:
     51             data = self.dataset[possibly_batched_index]

~/anaconda3/envs/python3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     47     def fetch(self, possibly_batched_index):
     48         if self.auto_collation:
---> 49             data = [self.dataset[idx] for idx in possibly_batched_index]
     50         else:
     51             data = self.dataset[possibly_batched_index]

/tmp/ipykernel_33283/2543102337.py in __getitem__(self, index)
     22         If labels are not None, then labels also
     23         '''
---> 24         encoding = create_features(self.entries[index],self.tokenizer, apply_mask_for_mlm=self.use_mlm)
     25 
     26         if self.labels==None:

~/docformer/examples/../src/docformer/dataset.py in create_features(image, tokenizer, add_batch_dim, target_size, max_seq_length, path_to_save, save_to_disk, apply_mask_for_mlm, extras_for_debugging, use_ocr, bounding_box, words)
    190 
    191     if (use_ocr == False) and (bounding_box == None or words == None):
--> 192         raise Exception('Please provide the bounding box and words or pass the argument "use_ocr" = True')
    193 
    194     if use_ocr == True:

Exception: Please provide the bounding box and words or pass the argument "use_ocr" = True

opened by bbalaji-ucsd 1

Shape mismatch during sanity check

Can the target size be changed. Currently when I try to change it throws mat1 and mat2 shapes cannot be multiplied (3072x768 and 192x128) . I tried with target size (1000, 768). Tried setting max_position_embeddings to 768 and it throws stack expects each tensor to be equal size, but got [767, 8] at entry 0 and [768, 8] at entry 2. I am trying to replicate document classification task from the notebook provided.

opened by Akhilesh64 1

Owner

GitHub

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL)

385 Jan 6, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

DUE: End-to-End Document Understanding Benchmark

This is the repository that provide tools to download data, reproduce the baseline results and evaluation. What can you achieve with this guide Based

21 Dec 29, 2022

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features | paper | Official PyTorch implementation for Mul

48 Dec 28, 2022

Task-based end-to-end model learning in stochastic optimization

Task-based End-to-end Model Learning in Stochastic Optimization This repository is by Priya L. Donti, Brandon Amos, and J. Zico Kolter and contains th

164 Dec 29, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

39 Dec 12, 2022

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

Object Pose Estimation Demo This tutorial will go through the steps necessary to perform pose estimation with a UR3 robotic arm in Unity. You’ll gain

187 Dec 24, 2022