TableBank: A Benchmark Dataset for Table Detection and Recognition

Overview

TableBank

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

News

  • We release an official split for the train/val/test datasets and re-train both of the Table Detection and Table Structure Recognition models using Detectron2 and OpenNMT tools. The benchmark results, the MODEL ZOO, and the download link of TableBank have been updated.
  • A new benchmark dataset DocBank (Paper, Repo) is now available for document layout analysis
  • Our data can only be used for research purpose
  • Our paper has been accepted in LREC 2020

Introduction

To address the need for a standard open domain table benchmark dataset, we propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.

The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.

Statistics of TableBank

Based on the number of tables

Task Word Latex Word+Latex
Table detection 163,417 253,817 417,234
Table structure recognition 56,866 88,597 145,463

Based on the number of images

Task Word Latex Word+Latex
Table detection 78,399 200,183 278,582
Table structure recognition 56,866 88,597 145,463

Statistics on Train/Val/Test sets of Table Detection

Source Train Val Test
Latex 187199 7265 5719
Word 73383 2735 2281
Total 260582 10000 8000

Statistics on Train/Val/Test sets of Table Structure Recognition

Source Train Val Test
Latex 79486 6075 3036
Word 50977 3925 1964
Total 130463 10000 5000

License

TableBank is released under the Attribution-NonCommercial-NoDerivs License. You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. If you remix, transform, or build upon the material, you may not distribute the modified material.

Task Definition

Table Detection

Table detection aims to locate tables using bounding boxes in a document. Given a document page in the image format, generating several bounding box that represents the location of tables in this page.

Table Structure Recognition

Table structure recognition aims to identify the row and column layout structure for the tables especially in non-digital document formats such as scanned images. Given a table in the image format, generating an HTML tag sequence that represents the arrangement of rows and columns as well as the type of table cells.

Baselines

To verify the effectiveness of Table-Bank, we build several strong baselines using the state-of-the-art models with end-to-end deep neural networks. The table detection model is based on the Faster R-CNN [Ren et al., 2015] architecture with different settings. The table structure recognition model is based on the encoder-decoder framework for image-to-text.

Data and Metrics

To evaluate table detection, we sample 18,000 document images from Word and Latex documents, where 10,000 images for validation and 8,000 images for testing. Each sampled image contains at least one table. Meanwhile, we also evaluate our model on the ICDAR 2013 dataset to verify the effectiveness of TableBank. To evaluate table structure recognition, we sample 15,000 table images from Word and Latex documents, where 10,000 images for validation and 5,000 images for testing. For table detection, we calculate the precision, recall and F1 in the way described in our paper, where the metrics for all documents are computed by summing up the area of overlap, prediction and ground truth. For table structure recognition, we use the 4-gram BLEU score as the evaluation metric with a single reference.

Table Detection

We use the open-source framework Detectron2 [Wu et al., 2019] to train models on the TableBank. Detectron2 is a high-quality and high-performance codebase for object detection research, which supports many state-of-the-art algorithms. In this task, we use the Faster R-CNN algorithm with the ResNeXt [Xie et al., 2016] as the backbone network architecture, where the parameters are pre-trained on the ImageNet dataset. All baselines are trained using 4 V100 NVIDIA GPUs using data-parallel sync SGD with a minibatch size of 20 images. For other parameters, we use the default values in Detectron2. During testing, the confidence threshold of generating bounding boxes is set to 90%.

Models Word Latex Word+Latex
Precision Recall F1 Precision Recall F1 Precision Recall F1
X101(Word) 0.9352 0.9398 0.9375 0.9905 0.5851 0.7356 0.9579 0.7474 0.8397
X152(Word) 0.9418 0.9415 0.9416 0.9912 0.6882 0.8124 0.9641 0.8041 0.8769
X101(Latex) 0.8453 0.9335 0.8872 0.9819 0.9799 0.9809 0.9159 0.9587 0.9368
X152(Latex) 0.8476 0.9264 0.8853 0.9816 0.9814 0.9815 0.9173 0.9562 0.9364
X101(Word+Latex) 0.9178 0.9363 0.9270 0.9827 0.9784 0.9806 0.9526 0.9592 0.9559
X152(Word+Latex) 0.9229 0.9266 0.9247 0.9837 0.9752 0.9795 0.9557 0.9530 0.9543

Table Structure Recognition

For table structure recognition, we use the open-source framework OpenNMT [Klein et al., 2017] to train the image-to-text model. OpenNMT is mainly designed for neural machine translation, which supports many encoder-decoder frameworks. In this task, we train our model using the image-to-text method in OpenNMT. The model is also trained using 4 V100 NVIDIA GPUs with the learning rate of 1 and batch size of 24. For other parameters, we use the default values in OpenNMT.

Models Word Latex Word+Latex
Image-to-Text (Word) 59.18 69.76 65.75
Image-to-Text (Latex) 51.45 71.63 63.08
Image-to-Text (Word+Latex) 69.93 77.94 74.54

Model Zoo

The trained models are available for download in the TableBank Model Zoo.

Get Data and Leaderboard

**Please DO NOT re-distribute our data.**

If you use the corpus in published work, please cite it referring to the "Paper and Citation" Section.

The annotations and original document pictures of the TableBank dataset can be download from the TableBank dataset homepage.

Paper and Citation

https://arxiv.org/abs/1903.01949

@misc{li2019tablebank,
    title={TableBank: A Benchmark Dataset for Table Detection and Recognition},
    author={Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou and Zhoujun Li},
    year={2019},
    eprint={1903.01949},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

References

  • [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [Gilani et al., 2017] A. Gilani, S. R. Qasim, I. Malik, and F. Shafait. Table detection using deep learning. In Proc. of ICDAR 2017, volume 01, pages 771–776, Nov 2017.
  • [Wu et al., 2019] Y Wu, A Kirillov, F Massa, WY Lo, R Girshick. Detectron2[J]. 2019.
  • [Xie et al., 2016] Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.
  • [Klein et al., 2017] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Open-NMT: Open-source toolkit for neural machine translation. In Proc. of ACL, 2017.]
Comments
  • some table not labeled

    some table not labeled

    I found there is some problem in the data , table not labeled . two example from Word.json

    {'category_id': 1, 'area': 46280, 'iscrowd': 0, 'segmentation': [[71, 176, 71, 280, 516, 280, 516, 176]], 'id': 69303, 'image_id': 53565, 'bbox': [71, 176, 445, 104]}
    
    {'category_id': 1, 'area': 143613, 'iscrowd': 0, 'segmentation': [[66, 72, 66, 269, 795, 269, 795, 72]], 'id': 67935, 'image_id': 52492, 'bbox': [66, 72, 729, 197]}
    

    53565

    52492

    opened by rockyzhengwu 4
  • how to extract the dataset ?

    how to extract the dataset ?

    I downloaded the dataset parts but I cannot manage to extract the files correctly.

    I tried different commands cited here: https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux

    But the only successful method was this one :

    cat test.zip.* >test.zip
    zip -FF test.zip --out test-full.zip
    unzip test-full.zip
    

    However, after the extraction one of the annotation json file is broken and has not been extracted correctly.

    Can someone share their way to extract the dataset please ?

    opened by GTimothee 2
  • No email reply

    No email reply

    I have submitted the form, but haven't received the reply email. Could you please send me the download link , my gmail address is [email protected].. Thanks a loooooooooot....

    opened by luckydog5 2
  • KeyError: 'Non-existent config key: _BASE_'

    KeyError: 'Non-existent config key: _BASE_'

    I am running the code on google colab, and running:

    !python detectron/tools/infer_simple.py --cfg /content/All_X101.yaml --output-dir /tmp/detectron-tablebank --image-ext jpg \
        --wts /content/model_final.pth /content/drive/MyDrive/TableBank/Image
    

    The error is:

    Found Detectron ops lib: /usr/local/lib/python3.7/dist-packages/torch/lib/libcaffe2_detectron_ops_gpu.so
    [E init_intrinsics_check.cc:44] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
    [E init_intrinsics_check.cc:44] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
    [E init_intrinsics_check.cc:44] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
    Traceback (most recent call last):
      File "detectron/tools/infer_simple.py", line 185, in <module>
        main(args)
      File "detectron/tools/infer_simple.py", line 125, in main
        merge_cfg_from_file(args.cfg)
      File "/content/detectron/detectron/core/config.py", line 1152, in merge_cfg_from_file
        _merge_a_into_b(yaml_cfg, __C)
      File "/content/detectron/detectron/core/config.py", line 1202, in _merge_a_into_b
        raise KeyError('Non-existent config key: {}'.format(full_key))
    KeyError: 'Non-existent config key: _BASE_'
    
    opened by omrastogi 1
  • Reproduce Precision, Recall and F1 score results from Detectron2 checkpoints

    Reproduce Precision, Recall and F1 score results from Detectron2 checkpoints

    Is it possible to have the code used for calculate the Precision, Recall and F1 score reported in the table here on GitHub? I'm talking about the results obtained with the last released checkpoints obtained with Detectron2.

    The instruction on the paper are not so clear...

    opened by francescoperessini 1
  • why need -src src-test.txt for image to text opennmt?

    why need -src src-test.txt for image to text opennmt?

    i was bit confused. please once explain me, help me anyone :)

    i was tried this way.

    python drive/My\ Drive/OpenNMT-py/translate.py -data_type img -model drive/My\ Drive/Pretrained_Word_Embeddings/detectron_table_detection/model.pt -src_dir drive/My\ Drive/datasets/table_dataset_sample/8.jpg \
      -output pred.txt -max_length 150 -beam_size 5 -gpu 0 -verbose
    
    
    
    

    i am getting same issue. i don't know what is -src_dir & -src

    usage: translate.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] --model
                        MODEL [MODEL ...] [--fp32] [--avg_raw_probs]
                        [--data_type DATA_TYPE] --src SRC [--src_dir SRC_DIR]
                        [--tgt TGT] [--shard_size SHARD_SIZE] [--output OUTPUT]
                        [--report_bleu] [--report_rouge] [--report_time]
                        [--dynamic_dict] [--share_vocab]
                        [--random_sampling_topk RANDOM_SAMPLING_TOPK]
                        [--random_sampling_temp RANDOM_SAMPLING_TEMP]
                        [--seed SEED] [--beam_size BEAM_SIZE]
                        [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                        [--max_sent_length] [--stepwise_penalty]
                        [--length_penalty {none,wu,avg}] [--ratio RATIO]
                        [--coverage_penalty {none,wu,summary}] [--alpha ALPHA]
                        [--beta BETA] [--block_ngram_repeat BLOCK_NGRAM_REPEAT]
                        [--ignore_when_blocking IGNORE_WHEN_BLOCKING [IGNORE_WHEN_BLOCKING ...]]
                        [--replace_unk] [--phrase_table PHRASE_TABLE] [--verbose]
                        [--log_file LOG_FILE]
                        [--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}]
                        [--attn_debug] [--dump_beam DUMP_BEAM] [--n_best N_BEST]
                        [--batch_size BATCH_SIZE] [--gpu GPU]
                        [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                        [--window_stride WINDOW_STRIDE] [--window WINDOW]
                        [--image_channel_size {3,1}]
    translate.py: error: the following arguments are required: --src/-src
    

    here docuentation (Image to text) they said,

    python translate.py -data_type img -model demo-model_acc_x_ppl_x_e13.pt -src_dir data/im2text/images \
    					-src data/im2text/src-test.txt -output pred.txt -max_length 150 -beam_size 5 -gpu 0 -verbose
    

    -src_dir: The directory containing the images.

    then why i need -src data/im2text/src-test.txt ?

    we want image to text. but src why txt. what is that can any one clarify me.

    Thank you all

    opened by MuruganR96 1
  • Prediction Using Table Recognition

    Prediction Using Table Recognition

    I used the follow command to predict structure of the table :

    python translate.py -model model.pt --src_dir './tables/' --src './src_txt.txt' -output pred.txt

    and I get the following error: AssertionError: Cannot use _dir with TextDataReader.

    From your previous replies to issues https://github.com/doc-analysis/TableBank/issues/12 and https://github.com/doc-analysis/TableBank/issues/10, its looks that I can test the model by using -tgt (providing a ground truth file)

    Can I not only predict on a sample?

    opened by sindhurk 0
  • Getting this error :   yaml.reader.ReaderError

    Getting this error : yaml.reader.ReaderError

    Traceback (most recent call last): File "tools/infer_simple.py", line 185, in main(args) File "tools/infer_simple.py", line 125, in main merge_cfg_from_file(args.cfg) File "/home/anshuman/detectron/detectron/core/config.py", line 1148, in merge_cfg_from_file yaml_cfg = AttrDict(load_cfg(f)) File "/home/anshuman/detectron/detectron/core/config.py", line 1142, in load_cfg return envu.yaml_load(cfg_to_load) File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/init.py", line 70, in load loader = Loader(stream) File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/loader.py", line 34, in init Reader.init(self, stream) File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/reader.py", line 74, in init self.check_printable(stream) File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/reader.py", line 144, in check_printable 'unicode', "special characters are not allowed") yaml.reader.ReaderError: unacceptable character #x0002: special characters are not allowed in "", position 0

    opened by anshumankmr 0
  • How do I use your model to make inferences on my own data?

    How do I use your model to make inferences on my own data?

    I tried to load up your model in model_zoo for detectron but seems like your fine tuned model is not in their repo. I downloaded your model but it only has a yaml file and a pth file, it does not have a frozen graph to make inferences with.

    How can I deploy your model to make predictions on my own data?

    opened by Kebudi 2
  • TestPretrainedModel.md detectron2

    TestPretrainedModel.md detectron2

    please make TestPretrainedModel.md as per detectron2 and i think you have upload the weights of detectron2 with detectron1 config. correct me if i am worng.

    opened by kbrajwani 19
Owner
null
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 91 Nov 22, 2022
Table recognition inside douments using neural networks

TableTrainNet A simple project for training and testing table recognition in documents. This project was developed to make a neural network which reco

Giovanni Cavallin 93 Jul 24, 2022
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

null 2.4k Jan 8, 2023
Table Extraction Tool

Tree Structure - Table Extraction Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables.

HazyResearch 88 Jun 2, 2022
Camelot: PDF Table Extraction for Humans

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Dec 31, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

null 27 Jan 8, 2023
This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

Amazon Web Services - Labs 422 Jan 3, 2023
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 3.2k Jan 4, 2023
Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform sign language recognition.

Sign Language Recognition Service This is a Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform s

Martin Lønne 1 Jan 8, 2022
A curated list of papers and resources for scene text detection and recognition

Awesome Scene Text A curated list of papers and resources for scene text detection and recognition The year when a paper was first published, includin

Jan Zdenek 43 Mar 15, 2022
End-to-end pipeline for real-time scene text detection and recognition.

Real-time-Scene-Text-Detection-and-Recognition-System End-to-end pipeline for real-time scene text detection and recognition. The detection model use

Fangneng Zhan 89 Aug 4, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
Scene text detection and recognition based on Extremal Region(ER)

Scene text recognition A real-time scene text recognition algorithm. Our system is able to recognize text in unconstrain background. This algorithm is

HSIEH, YI CHIA 155 Dec 6, 2022
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 1, 2022
Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneTextPapers Tracking the latest progress in Scene Text Detection and Recognition: must-read papers well organized Information about this repositor

Shangbang Long 763 Jan 1, 2023
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 7, 2023