SDL: Synthetic Document Layout dataset

Overview

SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sample image

image

Structure of data

structure

Quick start

python flexible_layout.py --config_file configs/page.yaml

Instruction to run data generation

Go to instruction

Visualization of the result

python data_manipulation/visualize.py

Vietnamese 300000 images link:

Release soon

Paper

https://arxiv.org/abs/2106.15117

You might also like...
Bnagla hand written document digiiztion
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

A toolkit for document-level event extraction, containing some SOTA model implementations
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Contract Understanding Atticus Dataset
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

Comments
  • could not find expected ':'

    could not find expected ':'

    Hi there, This repository is amazing! and i understand that you are currently finalizing this repository, yet i am eager to test it

    After removing <<<<<<< HEAD and ======= and >>>>>>> 60067811e5ad5b88814b1ae5fbf1df85b0ff3524 from flexible_layout.py, i get some error when trying to run.

    (udet) home@home-lnx:~/programs/SDL-Document-Image-Generation$ python flexible_layout.py --config_file configs/page.yaml
    Traceback (most recent call last):
      File "flexible_layout.py", line 145, in <module>
        args.merge_from_file(config_file)
      File "/home/home/programs/SDL-Document-Image-Generation/helper/config.py", line 6, in merge_from_file
        res = yaml.load(open(path,'r'), Loader=yaml.Loader)
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/__init__.py", line 114, in load
        return loader.get_single_data()
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/constructor.py", line 49, in get_single_data
        node = self.get_single_node()
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/composer.py", line 36, in get_single_node
        document = self.compose_document()
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/composer.py", line 55, in compose_document
        node = self.compose_node(None, None)
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
        node = self.compose_mapping_node(anchor)
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
        while not self.check_event(MappingEndEvent):
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/parser.py", line 98, in check_event
        self.current_event = self.state()
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
        if self.check_token(KeyToken):
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/scanner.py", line 115, in check_token
        while self.need_more_tokens():
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/scanner.py", line 152, in need_more_tokens
        self.stale_possible_simple_keys()
      File "/home/home/anaconda3/envs/udet/lib/python3.7/site-packages/yaml/scanner.py", line 292, in stale_possible_simple_keys
        "could not find expected ':'", self.get_mark())
    yaml.scanner.ScannerError: while scanning a simple key
      in "configs/page.yaml", line 2, column 1
    could not find expected ':'
      in "configs/page.yaml", line 3, column 1
    
    opened by seekingdeep 3
Owner
Sơn Nguyễn
Self-taught programmer Completed courses: CS50, MIT 6.006 Preferred language: Python
Sơn Nguyễn
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 7, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

null 1 Nov 16, 2021