Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft

Last update: Dec 22, 2022

Related tags

Computer Vision python data-science machine-learning synthetic-images data-generation ner ocr-recognition text-alignment synthetic-data synthetic-data-generation

Overview

Genalog - Synthetic Data Generator

Genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

Overview

Genalog has various capabilities:

Flexible format Image Generation
Custom image degradation
Extract Text from Images using Cognitive Search Pipeline
Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Please refer to our Genalog documentation for more tutorials.

Installation

See the Genalog install guide for more details.

To install the latest release:

pip install genalog

Extra Installation Steps in MacOs and Windows

We have a dependency on Weasyprint, which in turn has non-python dependencies including Pango, cairo and GDK-PixBuf that need to be installed separately.

So far, Pango, cairo and GDK-PixBuf libraries are available in Ubuntu-18.04 and later by default.

If you are running on Windows, MacOS, or other Linux distributions, please see installation instructions from WeasyPrint.

NOTE: If you encounter the errors like no library called "libcairo-2" was found, this is probably due to the three extra dependencies missing.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer the Jupyter notebook examples that make use of the core code base of Genalog and repository utilities.

TLDR

If you are interested in a full document generation and degration pipeline, please see the following notebook:

	Description	Indepth Jupyter Notebook Examples
1	Analog Document Generation Pipeline	Demo Notebook

Else we have in-depth walkthroughs of each of the module in Genalog.

	Steps	Indepth Jupyter Notebook Examples	Quick Start Guides
1	Create Template for Image Generation	Demo Notebook	Here is our guide to Document Generation
2	Degrade Prebuilt Images	Demo Notebook	Here is our guide to Image Degradation
3	Get Text From Images Using OCR	Demo Notebook	Here is our guide to Extracting Text
4	Align Text Produced from OCR with Ground Truth Text	Demo Notebook	Here is our guide to Text Alignment
5	NER Label Propagation from Ground Truth to OCR Tokens	Demo Notebook	Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

	Scenario	Indepth Jupyter Notebook
1	Synthetic Dataset Generation with LABELED NER Dataset	Demo Notebook

Other Requirements:

If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:
1. Azure Cognitive Search Service Quickstart Guide Here
2. Azure Blob Storage Quickstart Guide Here
See Azure Docs for more information on Azure Cognitive Search.

Package Release

Please see RELEASE.md for more details on the release process.

Repo Structure

genalog
├────genalog
│       ├─── generation                      # generate text images
│       ├──── degradation                    # methods for image degradation
│       ├──── ocr                            # running the Azure Search Pipeline
│       └──── text                           # methods to Align OCR Output Text with 
├────devops                                  # CI/CD pipelines
├────docs                                    # containing online documentaions
├────examples                                # example Jupyter Notebooks for Various 
├────tests                                   # tests
├────tox.ini                                 # CI orchestration and configurations
├────README.md
└────LICENSE

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citing `genalog`

If you find genalog helpful to your work, please consider citing our tool and paper using the following BibTeX entry:

@article{
  gupte2021genalog,
  title={Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents},
  author={Gupte, Amit and Romanov, Alexey and Mantravadi, Sahitya and Banda, Dalitso and Liu, Jianjie and Khan, Raza and Meenal, Lakshmanan Ramu and Han, Benjamin and Srinivasan, Soundar},
  journal={Document Intelligence Workshop at KDD 2021},
  year={2021}
}

Collaborators

Genalog was originally developed by the MAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.

Comments

Remove files specific to Project Enki
Remove files pertaining to Project Enki only:

Scripts pertaining to the TA model and the unlabeled dataset scenario.

Scripts pertaining to the specific Azure resource.
opened by laserprec 4
Retrieve position of rendered document

I want to use this tool to generate a synthetic dataset for the detection phase of the OCR pipelines, I wonder if there is a way to get a location (bounding box) of each word that is rendered to the final documents?

opened by parsa-ra 2
Can we add line_spacing?
Hello, I am trying to add linespacing. Even though I add new lines manually from txt, it still removes them.

with open(txt_path, 'r') as f: text = f.read() # Initialize Content Object text = text.replace('\n', '\n\n') paragraphs = text.split('\n\n\n')

printing paragraph gives the demanded result, however, default_generator.set_styles_to_generate(new_style_combinations) somehow removes blank lines. Thank you in advance
opened by egenc 2
How to run tests?

The RELEASE.md document does not specify how to run tests.

Would be good to have the information about running the tests in RELEASE.md in the "Preparation" step

opened by jgc128 0
Laserprec/production_release
Extend release pipeline to release onto PyPI

Update documentations

Add more badges

Skip flaky e2e test with Azure OCR (planning to deprecate genalog.ocr)
opened by laserprec 0
Update notebook documentation
Add a TDLR NB for analog document generation pipeline

Update documents on other Jupyter NBs

Resolve some broken links in the NBs and READMEs

Update main README
opened by laserprec 0
Bump numpy from 1.18.1 to 1.22.0
Bumps numpy from 1.18.1 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Dependencies specified too strictly

The dependencies specified in this project's requirements.txt are too strictly pinned, making it hard to install.

For instance, opencv-contrib-python==4.2.0.34 doesn't have releases for Python 3.9 at all.

It might be a good idea to change the requirements.txt specifiers to ~= instead.

Related SO issue: https://stackoverflow.com/a/72613708/51685

opened by akx 0
Access different pages if content is across multiple pages

If my content is more than a page long, what does doc.render_png() return? Looking at the source code, it seems to return it per page. How can I access the different pages programmatically to degrade them?

opened by hummingbird1989 0
Question about generate new templates(html.jinja)

Hi, Thank you for sharing nice work. @laserprec

I wanna make my own templates (html file which i have) , How can i make the jinja file that match with my html(or pdf file).

Could you please give me some tips for this issue?

And in additionally, can i put the template matching with my own pdf files?

I saw an issue about this, but i can't getting exactly way about this.

opened by yellowjs0304 0
Minor typos

Hi. Thanks for sharing a nice work.

There seem to be some minor typos in the docstrings.

https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/text/alignment.py#L203-L207

Maybe it should be changed as below? '.' is for substition → '.' is for substitution '-' indicates gap → ' ' indicates gap

Thanks.

opened by whwang299 1
genalog does not work with newer versions of weasyprint

Newer versions of weasyprint (53.x) removed their dependency on cairo and do not support PNG exports anymore (see Kozea/WeasyPrint#1232 and https://www.courtbouillon.org/blog/00004-weasyprint-without-cairo-what-s-different)

This breaks some parts of the genalog code, specifically the following methods are affected (as far as I have seen) https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L97 https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L127

Maybe some warning should be added to the documentation regarding this and what is the plan moving forward.

Thank you :D

opened by pedromb 2

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)
Genalog Changelog

All notable changes to this project will be documented in this file.

Types of changes

Added for new features.

Changed for changes in existing functionality.

Deprecated for soon-to-be removed features.

Removed for now removed features.

Fixed for any bug fixes.

Security in case of vulnerabilities.

The format is based on Keep a Changelog, and we adopt the Semantic Versioning.

[v0.1.0] - 2021-07-20

Added

Initial package release:

3 standard HTML document template for generation

basic image degradation effects including blur, bleed-through, salt & pepper and other morphological operations.

2 flavors of text alignment algorithm: Needleman-Wunsch (shorter text segments) and RETAS (longer text segments)

Full e2e NER-OCR label generation notebooks

See documentation for more on the initial features of the package.

Changes:

9b0a92c45fb948f00bde820aae57d24749ca30c8 Release 0.1.0 (#32)

9047cd6197feb49a1e27468c1484999a504210e0 Laserprec/bugfix img save (#31)

7c25f068018e1c167cfafb4c935dcec27df22a65 Laserprec/production_release (#30)

0e982f2724e9f8ef0e88bfd2d09230167027410b Laserprec/jupyter book doc (#28)

6180948ce5859b31f8e1f13ad10da8eeb068a599 Add copyright disclaimer (#27)

e01b609fe7f4df669066fc371fb2b505dba0ec6d Fix installation link (#26)

See More

09dd21405d76bc81aba4135874462c1b8c312800 Laserprec/install_link_update (#25)

cb4bf5981483ed48931c5cfa5ffbebc99641ecbb Update notebook documentation (#23)

58f3febadbf72b04e6a707b97fdf4fb9d9bbafac Merge pull request #22 from microsoft/laserprec-fix-broken-img-link

ababee2c9f34f91edac70c41ee2dd59eac8bee31 Update README.md

3362cdf6be3f73963f1a28cb97f012632e8521de Merge pull request #21 from microsoft/laserprec/update_readme

8e7b1576da2a6d4dd927cd248d7b8d0f682c1a68 Update installation instructions

4c91ea2569ffba9684ae4dd04e9015a2f5b138a8 Merge pull request #20 from microsoft/laserprec/e2e_tests

31d259b2e36700fe98eb7f8cdc4326cf2302c164 Add parameter value

7f989c7cadb5c7f163ab24a3351c9f0e0e8b45b1 Runn all e2e tests

2942d34267a15c9af7af1715abbc703cdbb743a0 Merge pull request #19 from microsoft/laserprec/sphinx_doc

f1bfe1f951d1619155fffae2cb3411771829923a Add badge for supported platforms

a97491cabd3a0872ac42c35b841031ee361e518d Correct docstring formatting

9f633e43aac13a8a2ea3cbbe362ba06ba0de8c4b Convert to google style docstring (keyword arg & code block)

62692b10ca6f04a1a330b125ad154ebdbdff4d18 Convert to google style docstring (parameter type & default to)

d65fca7934065a8b397d9e6a5d4fe1654b23fd02 Add sphinx config

5c4b634da99f06fafa25d158f4e973544f147926 Merge pull request #18 from microsoft/laserprec/comp_governance

c5aebb43721684566966f7d8fd8bc9705b811ae1 Add component governance as a manual step

9c858e6e19928f9e17ffd4800f45b3d327e2f44c Merge pull request #17 from microsoft/laserprec/xdist_tests

4bd93010d65c0a7df646382c5819406d4772ddeb Disable xdist in CI pipeline

0d45d80a763978f5ac031f688a601f136488c70d Xdist-runnable tests

c90deed73ef921d6a57fe1c5bab78cc098173a21 Merge pull request #16 from microsoft/laserprec/update_badge

51af25cc8c54ad0f931d76b7c579eef8db228d24 Update and add new badges

c2c9bc0d3a7a5ef42872ac6bf9c6b78ec7e60750 Remove PR trigger

5339c76d05fe5971efe701acb98668c772e55cf7 Merge pull request #15 from microsoft/laserprec/ci_templates

003bf3154a637783e2017720089d7e1105c2e6a8 Add publish artifacts step

e074a1d44fb8d5a562b0c2b95104522bebc979ac Remove test pipeline

4456fcaffa0263d11169c3049e288c1da3326aac Update test matrix

8d8b12dbc5d13ad34d4e87e6103abb831493096e Small format change

cac5394a69fc87a8160f5d19c94d770c1e478ec4 Use variable group

c7e1b62cdceef7e801579cccdb48dcbc195da3a7 Restructure templates

d83305e1f9166c324a9e6a375cc04d39f8878604 Run azure tests

be824f4890acb5ac359085ad4b1e0102a36c5fe9 Update PR-gate

60585902d15a20605c43d1b8957f2e828d4275b8 Add nightly build

547aa2c0f940c2ed4ef5193e7f444f8ac6864e41 Skip install direct dep

6316a9c9536aaf1f9206f73c51fb1c50c080cde8 Cache code coverage report

67c12a3c82ce239372581345d939ec33480d9948 Add final cov stage

b534f479edc522a7dd9aa29277e60c0d3afaeadc Separate test report in a template

190d823afcaaacaa137f078466914bb0eebbf4aa Mark more io tests

05f9649e3daffe4d68efd76b970fc56277940f7b Add detail test summary report

4ff93d3cb50fa18a94455f3e9c5b20c517754d7a Run tests in 3.6-8

b67c6b34ab6136f0931d53fbaa50ccfc0487d49b Update templates

36875705b4b91b85a879a3ab71e7b0389b52e4dd Add slow tests

7ae1d6c24432e05f2e86be6ca81df23cac2996fc Add CI templates

cc31cf5866db629dbeb84d043c57e3fa2f63437b Add postargs to tox

7cf054acfd7f1d638b1b2d9bd0863816249bcf9e Merge pull request #14 from microsoft/laserprec/bugfix/silentDiskWrite

9e4c37ada77252070190c45d5399786e48087230 Raise errors when writing to disk

1227eff672a9ec8c945f187be43904e1bed59ced Merge pull request #13 from microsoft/laserprec/bugfix/trailinglash

741c61c4d3ec90eb663b9e5b817a85a16e85c649 Bugfix img not generated due to missing trailling '/' in dist folder

d738bbb93e3a5ed56248680c435adfa5d4c9ab00 Merge pull request #12 from microsoft/laserprec/use_tox

1ea5dd4ffac3b9f48fc5e56a9dfaa641f3297a85 Restructure tests/ocr/data

f4bd57686c1c90f65c3f4279baa139d361201b3c Fix flake8 issues

b9e40a4511917edc35c55fc8d66bddbe7f4d2180 Use tox in CI

1cbe5ad49e90d064560ae04a92401c8b9e083330 Add logging and pytest marker for slow and azure tests

b6d1af74548c841ec0ed1ef4d310176a735fed17 Update CODEOWNERS

d9d19bc3fd9f116fe26651ae1892a1a5c37a49f3 Pass environment vars to tox

4132a8ff06d19487aca81f003f01dd484c8709d5 Tox-runnable

3b293bd856336988395a7aef209094d801a7ee6b Restructure unit tests

8da38073ff493b3413d7216e605aca66a8b514e6 Merge pull request #11 from microsoft/laserprec/restructure_ci

1793c3db023b201fad11422e75f89f3d71ce6afc Version bump to alpha3 to test release pipeline

55659aca5fd8f0bf59c7082aa92a7348210f9b76 Update build badge url

098fefc1d383b308293ac87141fa81ca85d5952d Relocate CI pipelines under devops folder

cda2c1e77d2ea6d39a6e3d0583954e5ca1896c8e Merge pull request #10 from microsoft/laserprec/use_linter

4c8445b8742f5eee3be23ef23e7ffcf58384af86 Fix import orders

f8c1e63b37ad2c6d0285ec31c920c4724d63d198 Add flake8 plugin to check import orders

a9620f51528c9715b33408a3ed9c902475c76eda Update build pipeline

5b87f4eddb7176f901f1deb7727cf0c5ba1c90ad Fix flake8 issues

96620b4c434bf18b5608d870d9d00da0150547c1 Fix linting issues in tests

ad504b4d7b4052bd21976802842949e80f45d1a5 Merge pull request #9 from microsoft/laserprec/hotfix_blob_version

f7db94fc79aa7fb9021591cf6d98918753900f34 Remove deprecated azure-storage dependencies

18facb099b2a86ae03b6ea4778281415f72be84f Update CI trigger

3ddee11879f582378986e3ede323ec027e8bd293 Hotfix azure blob version

f759f5037057ed366d961c7c54b7ce654849a4ba Merge pull request #6 from microsoft/laserprec/release-pipeline

d8cc7ffa8ceaf7b365881a042f23e67dcd67e075 Merge pull request #7 from microsoft/laserprec/update-build-status-badge

18d505e2f180c50d3b39fb456aabad514b7305c2 Update README.md

d7c33cd7f65619df3ba509a6ee0bc539fe529d3b Add release pipeline

d83d9da888552001a21bf629de9d30c615c54844 Merge pull request #4 from microsoft/laserprec/ci-pipeline-fix

4a1f8845690276aff95c0ae8d14e09a60dfd297a Add build env configurations

b4c7eee5037d3c41baa7c6ab50a5751095ec04fe Add absolute path to example links and update installation instruction

2a4e6104a9e263a9e26377d31779251a25ac8dc5 fix badge link

8b5961cc754e3f93ede3b7454f425b23a523f0bc read projection blob name from env

87453a472aa708b2f03cfc11f2c70156781319c8 Merge pull request #3 from microsoft/dbanda/add_cicd

7612a69ef7dcc12a657985247ebe68e412e0696e Update azure-pipeline.yml

f520a17e8ee12f67f544d3a774f90520d637c431 Update azure-pipeline.yml

cf3be451173ec1cd7c2cc5bddf0129d229fee6d5 add pipelines

c7d63b627f407bea7cbe7bea20f8becf1a60c31c Code Migration from Azure DevOps (#2)

8ca2c5127f320fb23954cad2b85b5196126ff218 Initial README.md commit

eeeb6f5f92a4e2460acac0066d663341081b3f8f Initial SECURITY.md commit

3d8eadcf54ac58b5a17f4678ad4e9d8f66e841a6 Initial LICENSE commit

b95102a864ef06a1eb141c3ca23a2643dca5d5a3 Initial CODE_OF_CONDUCT.md commit

This list of changes was auto generated.
Source code(tar.gz)
Source code(zip)
genalog-0.1.0-py3-none-any.whl(59.76 KB)
genalog-0.1.0.tar.gz(77.18 KB)

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub https://microsoft.github.io/genalog/

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

1.8k Dec 28, 2022

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

7.9k Jan 3, 2023

Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector

CRAFT: Character-Region Awareness For Text detection Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector | Paper |

188 Dec 28, 2022

A curated list of awesome synthetic data for text location and recognition

awesome-SynthText A curated list of awesome synthetic data for text location and recognition and OCR datasets. Text location SynthText SynthText_Chine

283 Jan 5, 2023

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

DeepSceneTextReader This is a c++ project deploying a deep scene text reading pipeline. It reads text from natural scene images. Prerequsites The proj

49 Sep 10, 2022

A synthetic data generator for text recognition

TextRecognitionDataGenerator A synthetic data generator for text recognition What is it for? Generating text image samples to train an OCR software. N

2.5k Jan 4, 2023

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

243 Dec 30, 2022

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

Binarize document images

Binarization Binarization for document images Examples Introduction This tool performs document image binarization (i.e. transform colour/grayscale to

48 Jan 2, 2023

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

81 Jan 1, 2023

Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

671 Dec 27, 2022

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

27 Jan 8, 2023

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Related tags

Overview

Genalog - Synthetic Data Generator

Overview

Installation

Extra Installation Steps in MacOs and Windows

Getting Started

TLDR

Other Requirements:

Package Release

Repo Structure

Trademark Notice

Microsoft Open Source Code of Conduct

Contribution Guidelines

Citing genalog

Collaborators

Comments

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)

Genalog Changelog

[v0.1.0] - 2021-07-20

Added

Changes:

Owner

Microsoft

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector

A curated list of awesome synthetic data for text location and recognition

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

A synthetic data generator for text recognition

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

Detect textlines in document images

Detect textlines in document images

Binarize document images

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

OCR system for Arabic language that converts images of typed text to machine-encoded text.

textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

The first open-source library that detects the font of a text in a image.

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Citing `genalog`

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio