The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Overview

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Progressive Learning

DocTr

1 2 3

DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction
ACM MM 2021 Oral

Any questions or discussions are welcomed!

Training

  • For geometric unwarping, we train the GeoTr network using the Doc3d dataset.
  • For illumination correction, we train the IllTr network based on the DRIC dataset.

Inference

  1. Download the pretrained models here and put them to $ROOT/model_pretrained/.
  2. Geometric unwarping:
    python inference.py
    
  3. Geometric unwarping and illumination rectification:
    python inference.py --ill_rec True
    

Evaluation

  • We use the same evaluation code as DocUNet benchmark dataset based on Matlab 2019a.
  • Please compare the scores according to your Matlab version.
  • Use the images available here for reproducing the quantitative performance reported in the paper and further comparison.

Citation

If you find this code useful for your research, please use the following BibTeX entry.

@inproceedings{feng2021doctr,
  title={DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction},
  author={Feng, Hao and Wang, Yuechen and Zhou, Wengang and Deng, Jiajun and Li, Houqiang},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  pages={273--281},
  year={2021}
}
@article{feng2021docscanner,
  title={DocScanner: Robust Document Image Rectification with Progressive Learning},
  author={Feng, Hao and Zhou, Wengang and Deng, Jiajun and Tian, Qi and Li, Houqiang},
  journal={arXiv preprint arXiv:2110.14968},
  year={2021}
}
Comments
  • DRIC dataset

    DRIC dataset

    Hi, "The DRIC dataset [18] consists of 2700 distorted document images, each at 2400 × 1800 resolution. For each distorted document image, there are corresponding backward mapping map and scanned PDF image. " But I cannot find the scanned PDF images of DRIC dataset?

    opened by THustc2021 3
  • About training

    About training

    Hello, 1: when are you going to open source training? 2: How fast is your plan? 3: What is used in the document detection of your scheme and how is the effect

    opened by CVUsers 3
  • four question about GeoTr.py

    four question about GeoTr.py

    hello hao, I have read your pioneering work on using ViT for document unwarping. In the paper, the design of the geometry tail is marvellous. which proposed a learnable module to perform upsampling on the decoded features $f_{d}$. As shown in the following figure: image for the part. I have four questions: Q1: Based on my understanding, such a design is essentially the local dot product of two features map(i mean $f_{o}$ and $f_m$ ). Do I understand correctly? I feel this design is un-imaginable for me. So, I wonder what is your motivation for this tail design? Is there any similar design in other reference papers?

    Q2: in the following codeblock, why the flow need to multiply by 8 https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L211

    Q3: in the following codeblock, why the mask need to be operated by softmax ? Is the softmax operation have some special significance here? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L209

    Q4: in the following codeblock, why coodslar should be added to pred backward mapping? Is this operation important? My guess is that the operation here is similar to a kind of position encoding. But here is the final operation in the network, why don't you add this position encoding to previous layer? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L231

    many thanks to your explanation.

    Best wishes, Weiguang Zhang

    opened by hanquansanren 2
  • Rectified Images

    Rectified Images

    thank you for your work! I have a question about the rectified images you gave in GoogleDrive or Baidu Cloud, are they rectified by your illumination rectification model?

    opened by lsabrinax 1
  • pls add environment in README

    pls add environment in README

    Hi, Thank you for sharing your work. I wanna try the inference.py, but I'm getting in stuck with setting. could you please share the environment or the requirements.txt? (CUDA Version, pytorch version, other packages .. etc)

    opened by yellowjs0304 1
  • Do you have a plan to release the training code of this paper?

    Do you have a plan to release the training code of this paper?

    Hi @fh2019ustc , Thanks you for great repo! Do you have a plan to release the training code of this paper? Hope to see your response, thanks in advance!

    opened by huyhoang17 1
  • illtr train

    illtr train

    @fh2019ustc 有几个问题请教一下 1)已经实现geotr,CER指标比repo低,但是SSIM、LD比repo差,这有可能是什么导致的? 2)现在准备复现illtr,docproj数据如何获取矫正后的图片,这个issue提到使用resampling.rectification,但是运行后并不能获取矫正图片 3)下载 ground truth scanned image只有550张,但是img、flow文件夹有2750张,那只用000_0.png矫正的图片训练illtr吗 4)论文中有crop为128*128,overlap=12.5%,但是detail中写到randomly crop,请问这部分是如何randomly crop的 image image

    opened by an1018 0
  • DocTr is introducing distortions. Am I doing anything wrong?

    DocTr is introducing distortions. Am I doing anything wrong?

    Hello there,

    I am using DocTr to enhance quality of few images in my project and I am finding that DocTr is introducing distortions in the file output. Pls let me know if I am using it incorrectly.

    Steps I followed.

    1. Cloned the DocTr project to my google drive.
    2. Copied the 3 .pth files and the image to desired location.
    3. Ran the following command in the colab. %cd /content/drive/MyDrive/MSProject/DocTr-cloned-repo/DocTr/ ! python inference.py --ill_rec True --distorrted_path '/content/drive/MyDrive/MSProject/temp/' --isave_path '/content/drive/MyDrive/MSProject/DocTr_output_images/'

    The original file that has been used is this image

    The image output from DocTr is this image

    Comparison for ease of reference Untitled_1

    Please let me know if this is a desired behavior or am I doing anything wrong

    Request your immediate response, as I have to conclude my research and submit my project as a part of my MS program.

    Thank you in advance

    opened by pavanbvns 1
  • About the OCR engine that you use, three questions need your help

    About the OCR engine that you use, three questions need your help

    image Q1: Hello, in section5.1 of your paper, I notice you used Pytesseract V3.02.02, as shown in the above picture ↑ But on the homepage of pytesseract, I only find the version of 0.3.~ or 0.2.~, could you please tell me the detailed version you use. By the way, in the paper of DewarpNet, they specify the Pytesseract on version 0.2.9. Are there big differences caused by the version of OCR engine?

    Q2: For the calculation of CER metric, it needs the ground true of each character in images, I also notice your repository provides 60 images index for OCR metric test, while the DewarpNet provided 25 images index as well as ground true in JSON form. Can you tell me how do you annotate the ground true? And if possible, can you share your ground true file?

    In addition, I also noticed 25 ground trues in DewarpNet have several label errors, I guess they also use some OCR metric. If you also use OCR engine to label the ground true, can your some me more details about how do you annotate?

    Q3: In fact, I also try to test the OCR performance over your model output. However, neither Pytesseract version 0.3.~ nor 0.2.~ achieve the same result in paper. Here is my OCR test code:

    from PIL import Image
    import pytesseract
    
    import json
    import os
    from os.path import join as pjoin
    from pathlib import Path
    import numpy as np
    
    
    def edit_distance(str1, str2):
        """计算两个字符串之间的编辑距离。
        Args:
            str1: 字符串1。
            str2: 字符串2。
        Returns:
            dist: 编辑距离。
        """
        matrix = [[i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
        for i in range(1, len(str1) + 1):
            for j in range(1, len(str2) + 1):
                if str1[i - 1] == str2[j - 1]:
                    d = 0
                else:
                    d = 1
                matrix[i][j] = min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + d)
        dist = matrix[len(str1)][len(str2)]
        return dist
    
    
    
    def get_cer(src, trg):
        """把源字符串src修改成目标字符串trg的字符错误率。
        Args:
            src: 源字符串。
            trg: 目标字符串。
        Returns:
            cer: 字符错误率。
        """
        dist = edit_distance(src, trg)
        cer = dist / len(trg)
        return cer
    
    if __name__ == "__main__":
        reference_list=[]
        reference_index=[] 
        img_dirList=[] 
        cer_list=[]  
        r_path = pjoin('./doctr/')
        reslut_file = open('result1.log', 'w')
        print(pytesseract.get_languages(config=''))
        with open('ocr_files.txt','r') as fr:	
            for l,line in enumerate(fr):
                reference_list.append(line)
                reference_index.append(l)
                print(len(line),line)
                print(len(line),line,file=reslut_file)
                h1str="./doctr/"+line[7:-1]+"_1 copy.png"
                h2str="./doctr/"+line[7:-1]+"_2 copy.png"
                print(h1str,h2str)
                h1=pytesseract.image_to_string(Image.open(h1str),lang='eng')
                h2=pytesseract.image_to_string(Image.open(h2str),lang='eng')
    
                with open('tess_gt.json','r') as file:
                    str = file.read()
                    r = json.loads(str).get(line[:-1])
                cer_value1=get_cer(h1, r)
                cer_value2=get_cer(h2, r)
                print(cer_value1,cer_value2)
                print(cer_value1,cer_value2,file=reslut_file)
                cer_list.append(cer_value1)
                cer_list.append(cer_value2)
        
        print(np.mean(cer_list)) 
        print(np.mean(cer_list),file=reslut_file)
        reslut_file.close()
    

    In brief, the core code for OCR is h1=pytesseract.image_to_string(Image.open(h1str),lang='eng') , with which I only get CER of 0.6. This result is far away from 0.2~0.3 CER as previous models.

    Could you share your OCR version and code for the OCR metric? Many thanks for your generous response!

    opened by hanquansanren 19
Owner
Hao Feng
PhD Candidate of Department of Electronic Engineering and Information Science University of Science and Technology of China
Hao Feng
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 2, 2023
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。

Heng Cai 209 Dec 30, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 59 Jun 29, 2021
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

null 1 Nov 16, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 7, 2021
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 7, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

null 84 Dec 15, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022