Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Overview

Genalog - Synthetic Data Generator

Build Status Azure DevOps tests (compact) Azure DevOps coverage (main) Python Versions Supported OSs MIT license docs link arxiv link

Genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

demo-gif

Overview

Genalog has various capabilities:

  1. Flexible format Image Generation
  2. Custom image degradation
  3. Extract Text from Images using Cognitive Search Pipeline
  4. Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Please refer to our Genalog documentation for more tutorials.

Installation

See the Genalog install guide for more details.

To install the latest release:

pip install genalog

Extra Installation Steps in MacOs and Windows

We have a dependency on Weasyprint, which in turn has non-python dependencies including Pango, cairo and GDK-PixBuf that need to be installed separately.

So far, Pango, cairo and GDK-PixBuf libraries are available in Ubuntu-18.04 and later by default.

If you are running on Windows, MacOS, or other Linux distributions, please see installation instructions from WeasyPrint.

NOTE: If you encounter the errors like no library called "libcairo-2" was found, this is probably due to the three extra dependencies missing.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer the Jupyter notebook examples that make use of the core code base of Genalog and repository utilities.

TLDR

If you are interested in a full document generation and degration pipeline, please see the following notebook:

Description Indepth Jupyter Notebook Examples
1 Analog Document Generation Pipeline Demo Notebook

Else we have in-depth walkthroughs of each of the module in Genalog.

Steps Indepth Jupyter Notebook Examples Quick Start Guides
1 Create Template for Image Generation Demo Notebook Here is our guide to Document Generation
2 Degrade Prebuilt Images Demo Notebook Here is our guide to Image Degradation
3 Get Text From Images Using OCR Demo Notebook Here is our guide to Extracting Text
4 Align Text Produced from OCR with Ground Truth Text Demo Notebook Here is our guide to Text Alignment
5 NER Label Propagation from Ground Truth to OCR Tokens Demo Notebook Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

Scenario Indepth Jupyter Notebook
1 Synthetic Dataset Generation with LABELED NER Dataset Demo Notebook

Other Requirements:

  1. If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:

    1. Azure Cognitive Search Service Quickstart Guide Here
    2. Azure Blob Storage Quickstart Guide Here

    See Azure Docs for more information on Azure Cognitive Search.

Package Release

Please see RELEASE.md for more details on the release process.

Repo Structure

genalog
├────genalog
│       ├─── generation                      # generate text images
│       ├──── degradation                    # methods for image degradation
│       ├──── ocr                            # running the Azure Search Pipeline
│       └──── text                           # methods to Align OCR Output Text with 
├────devops                                  # CI/CD pipelines
├────docs                                    # containing online documentaions
├────examples                                # example Jupyter Notebooks for Various 
├────tests                                   # tests
├────tox.ini                                 # CI orchestration and configurations
├────README.md
└────LICENSE

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citing genalog

If you find genalog helpful to your work, please consider citing our tool and paper using the following BibTeX entry:

@article{
  gupte2021genalog,
  title={Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents},
  author={Gupte, Amit and Romanov, Alexey and Mantravadi, Sahitya and Banda, Dalitso and Liu, Jianjie and Khan, Raza and Meenal, Lakshmanan Ramu and Han, Benjamin and Srinivasan, Soundar},
  journal={Document Intelligence Workshop at KDD 2021},
  year={2021}
}

Collaborators

Genalog was originally developed by the MAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.

Issues
  • Remove files specific to Project Enki

    Remove files specific to Project Enki

    Remove files pertaining to Project Enki only:

    1. Scripts pertaining to the TA model and the unlabeled dataset scenario.
    2. Scripts pertaining to the specific Azure resource.
    opened by laserprec 4
  • Can we  add line_spacing?

    Can we add line_spacing?

    Hello, I am trying to add linespacing. Even though I add new lines manually from txt, it still removes them.

    with open(txt_path, 'r') as f:
        text = f.read()
    
    # Initialize Content Object
    text = text.replace('\n', '\n\n')
    paragraphs = text.split('\n\n\n')
    

    printing paragraph gives the demanded result, however, default_generator.set_styles_to_generate(new_style_combinations) somehow removes blank lines. Thank you in advance

    opened by egenc 2
  • Update Build Status Badge

    Update Build Status Badge

    Update build status badge to point to https://dev.azure.com/genalog-dev/genalog/_build

    opened by laserprec 2
  • Add build.yml

    Add build.yml

    Add initial build files

    opened by laserprec 1
  • gray image only?

    gray image only?

    Hi, Can genalog degrade RGB images?

    opened by cloudfool 1
  • What about Genalog and other languages?

    What about Genalog and other languages?

    What about Genalog in c#?

    opened by Oleg26Dev 1
  • Add copyright disclaimer

    Add copyright disclaimer

    Adding copy right disclaimers to source files.

    opened by laserprec 0
  • Fix installation link

    Fix installation link

    opened by laserprec 0
  • Laserprec/install_link_update

    Laserprec/install_link_update

    1. Update installation link from TestPyPI
    2. Add AnalogDocumentGenerator diagram
    opened by laserprec 0
  • Laserprec/jupyter book doc

    Laserprec/jupyter book doc

    opened by laserprec 0
  • genalog does not work with newer versions of weasyprint

    genalog does not work with newer versions of weasyprint

    Newer versions of weasyprint (53.x) removed their dependency on cairo and do not support PNG exports anymore (see Kozea/WeasyPrint#1232 and https://www.courtbouillon.org/blog/00004-weasyprint-without-cairo-what-s-different)

    This breaks some parts of the genalog code, specifically the following methods are affected (as far as I have seen) https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L97 https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L127

    Maybe some warning should be added to the documentation regarding this and what is the plan moving forward.

    Thank you :D

    opened by pedromb 2
  • other languages

    other languages

    Does genalog support the Arabic language? Thanks in advance.

    opened by OmarMohammed88 1
  • Adding ability to extract a template CSS from a given PDF or image file

    Adding ability to extract a template CSS from a given PDF or image file

    Genalog is great in generating a synthetic document from a given template, but coming up with a template is still a pain.

    Wouldn't it be great if I can just point Genalog to a PDF or image, and ask it to synthesize more documents like that?

    In other words, can we add the functionality of extracting a CSS template out of a given PDF/image, to complete the cycle?

    Thanks!

    Document Intelligence

    enhancement 
    opened by document-intelligence 2
Releases(v0.1.0)
  • v0.1.0(Jul 20, 2021)

    Genalog Changelog

    All notable changes to this project will be documented in this file.

    Types of changes

    1. Added for new features.
    2. Changed for changes in existing functionality.
    3. Deprecated for soon-to-be removed features.
    4. Removed for now removed features.
    5. Fixed for any bug fixes.
    6. Security in case of vulnerabilities.

    The format is based on Keep a Changelog, and we adopt the Semantic Versioning.

    [v0.1.0] - 2021-07-20

    Added

    • Initial package release:
      • 3 standard HTML document template for generation
      • basic image degradation effects including blur, bleed-through, salt & pepper and other morphological operations.
      • 2 flavors of text alignment algorithm: Needleman-Wunsch (shorter text segments) and RETAS (longer text segments)
      • Full e2e NER-OCR label generation notebooks
      • See documentation for more on the initial features of the package.

    Changes:

    • 9b0a92c45fb948f00bde820aae57d24749ca30c8 Release 0.1.0 (#32)
    • 9047cd6197feb49a1e27468c1484999a504210e0 Laserprec/bugfix img save (#31)
    • 7c25f068018e1c167cfafb4c935dcec27df22a65 Laserprec/production_release (#30)
    • 0e982f2724e9f8ef0e88bfd2d09230167027410b Laserprec/jupyter book doc (#28)
    • 6180948ce5859b31f8e1f13ad10da8eeb068a599 Add copyright disclaimer (#27)
    • e01b609fe7f4df669066fc371fb2b505dba0ec6d Fix installation link (#26)
    See More
    • 09dd21405d76bc81aba4135874462c1b8c312800 Laserprec/install_link_update (#25)
    • cb4bf5981483ed48931c5cfa5ffbebc99641ecbb Update notebook documentation (#23)
    • 58f3febadbf72b04e6a707b97fdf4fb9d9bbafac Merge pull request #22 from microsoft/laserprec-fix-broken-img-link
    • ababee2c9f34f91edac70c41ee2dd59eac8bee31 Update README.md
    • 3362cdf6be3f73963f1a28cb97f012632e8521de Merge pull request #21 from microsoft/laserprec/update_readme
    • 8e7b1576da2a6d4dd927cd248d7b8d0f682c1a68 Update installation instructions
    • 4c91ea2569ffba9684ae4dd04e9015a2f5b138a8 Merge pull request #20 from microsoft/laserprec/e2e_tests
    • 31d259b2e36700fe98eb7f8cdc4326cf2302c164 Add parameter value
    • 7f989c7cadb5c7f163ab24a3351c9f0e0e8b45b1 Runn all e2e tests
    • 2942d34267a15c9af7af1715abbc703cdbb743a0 Merge pull request #19 from microsoft/laserprec/sphinx_doc
    • f1bfe1f951d1619155fffae2cb3411771829923a Add badge for supported platforms
    • a97491cabd3a0872ac42c35b841031ee361e518d Correct docstring formatting
    • 9f633e43aac13a8a2ea3cbbe362ba06ba0de8c4b Convert to google style docstring (keyword arg & code block)
    • 62692b10ca6f04a1a330b125ad154ebdbdff4d18 Convert to google style docstring (parameter type & default to)
    • d65fca7934065a8b397d9e6a5d4fe1654b23fd02 Add sphinx config
    • 5c4b634da99f06fafa25d158f4e973544f147926 Merge pull request #18 from microsoft/laserprec/comp_governance
    • c5aebb43721684566966f7d8fd8bc9705b811ae1 Add component governance as a manual step
    • 9c858e6e19928f9e17ffd4800f45b3d327e2f44c Merge pull request #17 from microsoft/laserprec/xdist_tests
    • 4bd93010d65c0a7df646382c5819406d4772ddeb Disable xdist in CI pipeline
    • 0d45d80a763978f5ac031f688a601f136488c70d Xdist-runnable tests
    • c90deed73ef921d6a57fe1c5bab78cc098173a21 Merge pull request #16 from microsoft/laserprec/update_badge
    • 51af25cc8c54ad0f931d76b7c579eef8db228d24 Update and add new badges
    • c2c9bc0d3a7a5ef42872ac6bf9c6b78ec7e60750 Remove PR trigger
    • 5339c76d05fe5971efe701acb98668c772e55cf7 Merge pull request #15 from microsoft/laserprec/ci_templates
    • 003bf3154a637783e2017720089d7e1105c2e6a8 Add publish artifacts step
    • e074a1d44fb8d5a562b0c2b95104522bebc979ac Remove test pipeline
    • 4456fcaffa0263d11169c3049e288c1da3326aac Update test matrix
    • 8d8b12dbc5d13ad34d4e87e6103abb831493096e Small format change
    • cac5394a69fc87a8160f5d19c94d770c1e478ec4 Use variable group
    • c7e1b62cdceef7e801579cccdb48dcbc195da3a7 Restructure templates
    • d83305e1f9166c324a9e6a375cc04d39f8878604 Run azure tests
    • be824f4890acb5ac359085ad4b1e0102a36c5fe9 Update PR-gate
    • 60585902d15a20605c43d1b8957f2e828d4275b8 Add nightly build
    • 547aa2c0f940c2ed4ef5193e7f444f8ac6864e41 Skip install direct dep
    • 6316a9c9536aaf1f9206f73c51fb1c50c080cde8 Cache code coverage report
    • 67c12a3c82ce239372581345d939ec33480d9948 Add final cov stage
    • b534f479edc522a7dd9aa29277e60c0d3afaeadc Separate test report in a template
    • 190d823afcaaacaa137f078466914bb0eebbf4aa Mark more io tests
    • 05f9649e3daffe4d68efd76b970fc56277940f7b Add detail test summary report
    • 4ff93d3cb50fa18a94455f3e9c5b20c517754d7a Run tests in 3.6-8
    • b67c6b34ab6136f0931d53fbaa50ccfc0487d49b Update templates
    • 36875705b4b91b85a879a3ab71e7b0389b52e4dd Add slow tests
    • 7ae1d6c24432e05f2e86be6ca81df23cac2996fc Add CI templates
    • cc31cf5866db629dbeb84d043c57e3fa2f63437b Add postargs to tox
    • 7cf054acfd7f1d638b1b2d9bd0863816249bcf9e Merge pull request #14 from microsoft/laserprec/bugfix/silentDiskWrite
    • 9e4c37ada77252070190c45d5399786e48087230 Raise errors when writing to disk
    • 1227eff672a9ec8c945f187be43904e1bed59ced Merge pull request #13 from microsoft/laserprec/bugfix/trailinglash
    • 741c61c4d3ec90eb663b9e5b817a85a16e85c649 Bugfix img not generated due to missing trailling '/' in dist folder
    • d738bbb93e3a5ed56248680c435adfa5d4c9ab00 Merge pull request #12 from microsoft/laserprec/use_tox
    • 1ea5dd4ffac3b9f48fc5e56a9dfaa641f3297a85 Restructure tests/ocr/data
    • f4bd57686c1c90f65c3f4279baa139d361201b3c Fix flake8 issues
    • b9e40a4511917edc35c55fc8d66bddbe7f4d2180 Use tox in CI
    • 1cbe5ad49e90d064560ae04a92401c8b9e083330 Add logging and pytest marker for slow and azure tests
    • b6d1af74548c841ec0ed1ef4d310176a735fed17 Update CODEOWNERS
    • d9d19bc3fd9f116fe26651ae1892a1a5c37a49f3 Pass environment vars to tox
    • 4132a8ff06d19487aca81f003f01dd484c8709d5 Tox-runnable
    • 3b293bd856336988395a7aef209094d801a7ee6b Restructure unit tests
    • 8da38073ff493b3413d7216e605aca66a8b514e6 Merge pull request #11 from microsoft/laserprec/restructure_ci
    • 1793c3db023b201fad11422e75f89f3d71ce6afc Version bump to alpha3 to test release pipeline
    • 55659aca5fd8f0bf59c7082aa92a7348210f9b76 Update build badge url
    • 098fefc1d383b308293ac87141fa81ca85d5952d Relocate CI pipelines under devops folder
    • cda2c1e77d2ea6d39a6e3d0583954e5ca1896c8e Merge pull request #10 from microsoft/laserprec/use_linter
    • 4c8445b8742f5eee3be23ef23e7ffcf58384af86 Fix import orders
    • f8c1e63b37ad2c6d0285ec31c920c4724d63d198 Add flake8 plugin to check import orders
    • a9620f51528c9715b33408a3ed9c902475c76eda Update build pipeline
    • 5b87f4eddb7176f901f1deb7727cf0c5ba1c90ad Fix flake8 issues
    • 96620b4c434bf18b5608d870d9d00da0150547c1 Fix linting issues in tests
    • ad504b4d7b4052bd21976802842949e80f45d1a5 Merge pull request #9 from microsoft/laserprec/hotfix_blob_version
    • f7db94fc79aa7fb9021591cf6d98918753900f34 Remove deprecated azure-storage dependencies
    • 18facb099b2a86ae03b6ea4778281415f72be84f Update CI trigger
    • 3ddee11879f582378986e3ede323ec027e8bd293 Hotfix azure blob version
    • f759f5037057ed366d961c7c54b7ce654849a4ba Merge pull request #6 from microsoft/laserprec/release-pipeline
    • d8cc7ffa8ceaf7b365881a042f23e67dcd67e075 Merge pull request #7 from microsoft/laserprec/update-build-status-badge
    • 18d505e2f180c50d3b39fb456aabad514b7305c2 Update README.md
    • d7c33cd7f65619df3ba509a6ee0bc539fe529d3b Add release pipeline
    • d83d9da888552001a21bf629de9d30c615c54844 Merge pull request #4 from microsoft/laserprec/ci-pipeline-fix
    • 4a1f8845690276aff95c0ae8d14e09a60dfd297a Add build env configurations
    • b4c7eee5037d3c41baa7c6ab50a5751095ec04fe Add absolute path to example links and update installation instruction
    • 2a4e6104a9e263a9e26377d31779251a25ac8dc5 fix badge link
    • 8b5961cc754e3f93ede3b7454f425b23a523f0bc read projection blob name from env
    • 87453a472aa708b2f03cfc11f2c70156781319c8 Merge pull request #3 from microsoft/dbanda/add_cicd
    • 7612a69ef7dcc12a657985247ebe68e412e0696e Update azure-pipeline.yml
    • f520a17e8ee12f67f544d3a774f90520d637c431 Update azure-pipeline.yml
    • cf3be451173ec1cd7c2cc5bddf0129d229fee6d5 add pipelines
    • c7d63b627f407bea7cbe7bea20f8becf1a60c31c Code Migration from Azure DevOps (#2)
    • 8ca2c5127f320fb23954cad2b85b5196126ff218 Initial README.md commit
    • eeeb6f5f92a4e2460acac0066d663341081b3f8f Initial SECURITY.md commit
    • 3d8eadcf54ac58b5a17f4678ad4e9d8f66e841a6 Initial LICENSE commit
    • b95102a864ef06a1eb141c3ca23a2643dca5d5a3 Initial CODE_OF_CONDUCT.md commit

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
    genalog-0.1.0-py3-none-any.whl(59.76 KB)
    genalog-0.1.0.tar.gz(77.18 KB)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

Scene Text Localization & Recognition Resources Read this institute-wise: English, 简体中文. Read this year-wise: English, 简体中文. Tags: [STL] (Scene Text L

Karl Lok (Zhaokai Luo) 835 Oct 20, 2021
Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneTextPapers Tracking the latest progress in Scene Text Detection and Recognition: must-read papers well organized Information about this repositor

Shangbang Long 713 Oct 13, 2021
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

null 2.3k Oct 26, 2021
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 349 Oct 5, 2021
A curated list of papers and resources for scene text detection and recognition

Awesome Scene Text A curated list of papers and resources for scene text detection and recognition The year when a paper was first published, includin

Jan Zdenek 42 Oct 16, 2021
Ocular is a state-of-the-art historical OCR system.

Ocular Ocular is a state-of-the-art historical OCR system. Its primary features are: Unsupervised learning of unknown fonts: requires only document im

null 214 Oct 9, 2021
ScanTailor Advanced is the version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes.

ScanTailor Advanced The ScanTailor version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and f

null 795 Oct 21, 2021
A curated list of promising OCR resources

Call for contributor(paper summary,dataset generation,algorithm implementation and any other useful resources) awesome-ocr A curated list of promising

wanghaisheng 1.5k Oct 17, 2021
A curated list of resources dedicated to scene text localization and recognition

Scene Text Localization & Recognition Resources A curated list of resources dedicated to scene text localization and recognition. Any suggestions and

CarlosTao 1.6k Oct 15, 2021
Links to awesome OCR projects

Awesome OCR This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR). Contribution

Konstantin Baierer 1.9k Oct 20, 2021
Simple SDF mesh generation in Python

Generate 3D meshes based on SDFs (signed distance functions) with a dirt simple Python API.

Michael Fogleman 834 Oct 21, 2021
Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

Mauricio Villegas 24 Jul 17, 2021
A synthetic data generator for text recognition

TextRecognitionDataGenerator A synthetic data generator for text recognition What is it for? Generating text image samples to train an OCR software. N

Edouard Belval 2k Oct 21, 2021
a Deep Learning Framework for Text

DeLFT DeLFT (Deep Learning Framework for Text) is a Keras and TensorFlow framework for text processing, focusing on sequence labelling (e.g. named ent

Patrice Lopez 317 Oct 21, 2021
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 31 Jul 24, 2021
[python3.6] 运用tf实现自然场景文字检测,keras/pytorch实现ctpn+crnn+ctc实现不定长场景文字OCR识别

本文基于tensorflow、keras/pytorch实现对自然场景的文字检测及端到端的OCR中文文字识别 update20190706 为解决本项目中对数学公式预测的准确性,做了其他的改进和尝试,效果还不错,https://github.com/xiaofengShi/Image2Katex 希

xiaofeng 2.6k Oct 25, 2021
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 2.6k Oct 20, 2021
Document Layout Analysis

Eynollah Document Layout Analysis Introduction This tool performs document layout analysis (segmentation) from image data and returns the results as P

QURATOR-SPK 109 Oct 21, 2021
A list of hyperspectral image super-solution resources collected by Junjun Jiang

A list of hyperspectral image super-resolution resources collected by Junjun Jiang. If you find that important resources are not included, please feel free to contact me.

Junjun Jiang 209 Oct 18, 2021