Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Overview

Genalog - Synthetic Data Generator

Build Status Azure DevOps tests (compact) Azure DevOps coverage (main) Python Versions Supported OSs MIT license docs link arxiv link

Genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

demo-gif

Overview

Genalog has various capabilities:

  1. Flexible format Image Generation
  2. Custom image degradation
  3. Extract Text from Images using Cognitive Search Pipeline
  4. Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Please refer to our Genalog documentation for more tutorials.

Installation

See the Genalog install guide for more details.

To install the latest release:

pip install genalog

Extra Installation Steps in MacOs and Windows

We have a dependency on Weasyprint, which in turn has non-python dependencies including Pango, cairo and GDK-PixBuf that need to be installed separately.

So far, Pango, cairo and GDK-PixBuf libraries are available in Ubuntu-18.04 and later by default.

If you are running on Windows, MacOS, or other Linux distributions, please see installation instructions from WeasyPrint.

NOTE: If you encounter the errors like no library called "libcairo-2" was found, this is probably due to the three extra dependencies missing.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer the Jupyter notebook examples that make use of the core code base of Genalog and repository utilities.

TLDR

If you are interested in a full document generation and degration pipeline, please see the following notebook:

Description Indepth Jupyter Notebook Examples
1 Analog Document Generation Pipeline Demo Notebook

Else we have in-depth walkthroughs of each of the module in Genalog.

Steps Indepth Jupyter Notebook Examples Quick Start Guides
1 Create Template for Image Generation Demo Notebook Here is our guide to Document Generation
2 Degrade Prebuilt Images Demo Notebook Here is our guide to Image Degradation
3 Get Text From Images Using OCR Demo Notebook Here is our guide to Extracting Text
4 Align Text Produced from OCR with Ground Truth Text Demo Notebook Here is our guide to Text Alignment
5 NER Label Propagation from Ground Truth to OCR Tokens Demo Notebook Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

Scenario Indepth Jupyter Notebook
1 Synthetic Dataset Generation with LABELED NER Dataset Demo Notebook

Other Requirements:

  1. If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:

    1. Azure Cognitive Search Service Quickstart Guide Here
    2. Azure Blob Storage Quickstart Guide Here

    See Azure Docs for more information on Azure Cognitive Search.

Package Release

Please see RELEASE.md for more details on the release process.

Repo Structure

genalog
├────genalog
│       ├─── generation                      # generate text images
│       ├──── degradation                    # methods for image degradation
│       ├──── ocr                            # running the Azure Search Pipeline
│       └──── text                           # methods to Align OCR Output Text with 
├────devops                                  # CI/CD pipelines
├────docs                                    # containing online documentaions
├────examples                                # example Jupyter Notebooks for Various 
├────tests                                   # tests
├────tox.ini                                 # CI orchestration and configurations
├────README.md
└────LICENSE

Trademark Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citing genalog

If you find genalog helpful to your work, please consider citing our tool and paper using the following BibTeX entry:

@article{
  gupte2021genalog,
  title={Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents},
  author={Gupte, Amit and Romanov, Alexey and Mantravadi, Sahitya and Banda, Dalitso and Liu, Jianjie and Khan, Raza and Meenal, Lakshmanan Ramu and Han, Benjamin and Srinivasan, Soundar},
  journal={Document Intelligence Workshop at KDD 2021},
  year={2021}
}

Collaborators

Genalog was originally developed by the MAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.

Comments
  • Remove files specific to Project Enki

    Remove files specific to Project Enki

    Remove files pertaining to Project Enki only:

    1. Scripts pertaining to the TA model and the unlabeled dataset scenario.
    2. Scripts pertaining to the specific Azure resource.
    opened by laserprec 4
  • Retrieve position of rendered document

    Retrieve position of rendered document

    I want to use this tool to generate a synthetic dataset for the detection phase of the OCR pipelines, I wonder if there is a way to get a location (bounding box) of each word that is rendered to the final documents?

    opened by parsa-ra 2
  • Can we  add line_spacing?

    Can we add line_spacing?

    Hello, I am trying to add linespacing. Even though I add new lines manually from txt, it still removes them.

    with open(txt_path, 'r') as f:
        text = f.read()
    
    # Initialize Content Object
    text = text.replace('\n', '\n\n')
    paragraphs = text.split('\n\n\n')
    

    printing paragraph gives the demanded result, however, default_generator.set_styles_to_generate(new_style_combinations) somehow removes blank lines. Thank you in advance

    opened by egenc 2
  • How to run tests?

    How to run tests?

    The RELEASE.md document does not specify how to run tests.

    Would be good to have the information about running the tests in RELEASE.md in the "Preparation" step

    opened by jgc128 0
  • Laserprec/production_release

    Laserprec/production_release

    • Extend release pipeline to release onto PyPI
    • Update documentations
    • Add more badges
    • Skip flaky e2e test with Azure OCR (planning to deprecate genalog.ocr)
    opened by laserprec 0
  • Update notebook documentation

    Update notebook documentation

    1. Add a TDLR NB for analog document generation pipeline
    2. Update documents on other Jupyter NBs
    3. Resolve some broken links in the NBs and READMEs
    4. Update main README
    opened by laserprec 0
  • Bump numpy from 1.18.1 to 1.22.0

    Bump numpy from 1.18.1 to 1.22.0

    Bumps numpy from 1.18.1 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Dependencies specified too strictly

    Dependencies specified too strictly

    The dependencies specified in this project's requirements.txt are too strictly pinned, making it hard to install.

    For instance, opencv-contrib-python==4.2.0.34 doesn't have releases for Python 3.9 at all.

    It might be a good idea to change the requirements.txt specifiers to ~= instead.

    Related SO issue: https://stackoverflow.com/a/72613708/51685

    opened by akx 0
  • Access different pages if content is across multiple pages

    Access different pages if content is across multiple pages

    If my content is more than a page long, what does doc.render_png() return? Looking at the source code, it seems to return it per page. How can I access the different pages programmatically to degrade them?

    opened by hummingbird1989 0
  • Question about generate new templates(html.jinja)

    Question about generate new templates(html.jinja)

    Hi, Thank you for sharing nice work. @laserprec

    I wanna make my own templates (html file which i have) , How can i make the jinja file that match with my html(or pdf file).

    Could you please give me some tips for this issue?

    And in additionally, can i put the template matching with my own pdf files?

    I saw an issue about this, but i can't getting exactly way about this.

    opened by yellowjs0304 0
  • Minor typos

    Minor typos

    Hi. Thanks for sharing a nice work.

    There seem to be some minor typos in the docstrings.

    https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/text/alignment.py#L203-L207

    Maybe it should be changed as below? '.' is for substition'.' is for substitution '-' indicates gap' ' indicates gap

    Thanks.

    opened by whwang299 1
  • genalog does not work with newer versions of weasyprint

    genalog does not work with newer versions of weasyprint

    Newer versions of weasyprint (53.x) removed their dependency on cairo and do not support PNG exports anymore (see Kozea/WeasyPrint#1232 and https://www.courtbouillon.org/blog/00004-weasyprint-without-cairo-what-s-different)

    This breaks some parts of the genalog code, specifically the following methods are affected (as far as I have seen) https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L97 https://github.com/microsoft/genalog/blob/b8b9fbabdde5855e8bcc9db025e78cc619e15cd1/genalog/generation/document.py#L127

    Maybe some warning should be added to the documentation regarding this and what is the plan moving forward.

    Thank you :D

    opened by pedromb 2
Releases(v0.1.0)
  • v0.1.0(Jul 20, 2021)

    Genalog Changelog

    All notable changes to this project will be documented in this file.

    Types of changes

    1. Added for new features.
    2. Changed for changes in existing functionality.
    3. Deprecated for soon-to-be removed features.
    4. Removed for now removed features.
    5. Fixed for any bug fixes.
    6. Security in case of vulnerabilities.

    The format is based on Keep a Changelog, and we adopt the Semantic Versioning.

    [v0.1.0] - 2021-07-20

    Added

    • Initial package release:
      • 3 standard HTML document template for generation
      • basic image degradation effects including blur, bleed-through, salt & pepper and other morphological operations.
      • 2 flavors of text alignment algorithm: Needleman-Wunsch (shorter text segments) and RETAS (longer text segments)
      • Full e2e NER-OCR label generation notebooks
      • See documentation for more on the initial features of the package.

    Changes:

    • 9b0a92c45fb948f00bde820aae57d24749ca30c8 Release 0.1.0 (#32)
    • 9047cd6197feb49a1e27468c1484999a504210e0 Laserprec/bugfix img save (#31)
    • 7c25f068018e1c167cfafb4c935dcec27df22a65 Laserprec/production_release (#30)
    • 0e982f2724e9f8ef0e88bfd2d09230167027410b Laserprec/jupyter book doc (#28)
    • 6180948ce5859b31f8e1f13ad10da8eeb068a599 Add copyright disclaimer (#27)
    • e01b609fe7f4df669066fc371fb2b505dba0ec6d Fix installation link (#26)
    See More
    • 09dd21405d76bc81aba4135874462c1b8c312800 Laserprec/install_link_update (#25)
    • cb4bf5981483ed48931c5cfa5ffbebc99641ecbb Update notebook documentation (#23)
    • 58f3febadbf72b04e6a707b97fdf4fb9d9bbafac Merge pull request #22 from microsoft/laserprec-fix-broken-img-link
    • ababee2c9f34f91edac70c41ee2dd59eac8bee31 Update README.md
    • 3362cdf6be3f73963f1a28cb97f012632e8521de Merge pull request #21 from microsoft/laserprec/update_readme
    • 8e7b1576da2a6d4dd927cd248d7b8d0f682c1a68 Update installation instructions
    • 4c91ea2569ffba9684ae4dd04e9015a2f5b138a8 Merge pull request #20 from microsoft/laserprec/e2e_tests
    • 31d259b2e36700fe98eb7f8cdc4326cf2302c164 Add parameter value
    • 7f989c7cadb5c7f163ab24a3351c9f0e0e8b45b1 Runn all e2e tests
    • 2942d34267a15c9af7af1715abbc703cdbb743a0 Merge pull request #19 from microsoft/laserprec/sphinx_doc
    • f1bfe1f951d1619155fffae2cb3411771829923a Add badge for supported platforms
    • a97491cabd3a0872ac42c35b841031ee361e518d Correct docstring formatting
    • 9f633e43aac13a8a2ea3cbbe362ba06ba0de8c4b Convert to google style docstring (keyword arg & code block)
    • 62692b10ca6f04a1a330b125ad154ebdbdff4d18 Convert to google style docstring (parameter type & default to)
    • d65fca7934065a8b397d9e6a5d4fe1654b23fd02 Add sphinx config
    • 5c4b634da99f06fafa25d158f4e973544f147926 Merge pull request #18 from microsoft/laserprec/comp_governance
    • c5aebb43721684566966f7d8fd8bc9705b811ae1 Add component governance as a manual step
    • 9c858e6e19928f9e17ffd4800f45b3d327e2f44c Merge pull request #17 from microsoft/laserprec/xdist_tests
    • 4bd93010d65c0a7df646382c5819406d4772ddeb Disable xdist in CI pipeline
    • 0d45d80a763978f5ac031f688a601f136488c70d Xdist-runnable tests
    • c90deed73ef921d6a57fe1c5bab78cc098173a21 Merge pull request #16 from microsoft/laserprec/update_badge
    • 51af25cc8c54ad0f931d76b7c579eef8db228d24 Update and add new badges
    • c2c9bc0d3a7a5ef42872ac6bf9c6b78ec7e60750 Remove PR trigger
    • 5339c76d05fe5971efe701acb98668c772e55cf7 Merge pull request #15 from microsoft/laserprec/ci_templates
    • 003bf3154a637783e2017720089d7e1105c2e6a8 Add publish artifacts step
    • e074a1d44fb8d5a562b0c2b95104522bebc979ac Remove test pipeline
    • 4456fcaffa0263d11169c3049e288c1da3326aac Update test matrix
    • 8d8b12dbc5d13ad34d4e87e6103abb831493096e Small format change
    • cac5394a69fc87a8160f5d19c94d770c1e478ec4 Use variable group
    • c7e1b62cdceef7e801579cccdb48dcbc195da3a7 Restructure templates
    • d83305e1f9166c324a9e6a375cc04d39f8878604 Run azure tests
    • be824f4890acb5ac359085ad4b1e0102a36c5fe9 Update PR-gate
    • 60585902d15a20605c43d1b8957f2e828d4275b8 Add nightly build
    • 547aa2c0f940c2ed4ef5193e7f444f8ac6864e41 Skip install direct dep
    • 6316a9c9536aaf1f9206f73c51fb1c50c080cde8 Cache code coverage report
    • 67c12a3c82ce239372581345d939ec33480d9948 Add final cov stage
    • b534f479edc522a7dd9aa29277e60c0d3afaeadc Separate test report in a template
    • 190d823afcaaacaa137f078466914bb0eebbf4aa Mark more io tests
    • 05f9649e3daffe4d68efd76b970fc56277940f7b Add detail test summary report
    • 4ff93d3cb50fa18a94455f3e9c5b20c517754d7a Run tests in 3.6-8
    • b67c6b34ab6136f0931d53fbaa50ccfc0487d49b Update templates
    • 36875705b4b91b85a879a3ab71e7b0389b52e4dd Add slow tests
    • 7ae1d6c24432e05f2e86be6ca81df23cac2996fc Add CI templates
    • cc31cf5866db629dbeb84d043c57e3fa2f63437b Add postargs to tox
    • 7cf054acfd7f1d638b1b2d9bd0863816249bcf9e Merge pull request #14 from microsoft/laserprec/bugfix/silentDiskWrite
    • 9e4c37ada77252070190c45d5399786e48087230 Raise errors when writing to disk
    • 1227eff672a9ec8c945f187be43904e1bed59ced Merge pull request #13 from microsoft/laserprec/bugfix/trailinglash
    • 741c61c4d3ec90eb663b9e5b817a85a16e85c649 Bugfix img not generated due to missing trailling '/' in dist folder
    • d738bbb93e3a5ed56248680c435adfa5d4c9ab00 Merge pull request #12 from microsoft/laserprec/use_tox
    • 1ea5dd4ffac3b9f48fc5e56a9dfaa641f3297a85 Restructure tests/ocr/data
    • f4bd57686c1c90f65c3f4279baa139d361201b3c Fix flake8 issues
    • b9e40a4511917edc35c55fc8d66bddbe7f4d2180 Use tox in CI
    • 1cbe5ad49e90d064560ae04a92401c8b9e083330 Add logging and pytest marker for slow and azure tests
    • b6d1af74548c841ec0ed1ef4d310176a735fed17 Update CODEOWNERS
    • d9d19bc3fd9f116fe26651ae1892a1a5c37a49f3 Pass environment vars to tox
    • 4132a8ff06d19487aca81f003f01dd484c8709d5 Tox-runnable
    • 3b293bd856336988395a7aef209094d801a7ee6b Restructure unit tests
    • 8da38073ff493b3413d7216e605aca66a8b514e6 Merge pull request #11 from microsoft/laserprec/restructure_ci
    • 1793c3db023b201fad11422e75f89f3d71ce6afc Version bump to alpha3 to test release pipeline
    • 55659aca5fd8f0bf59c7082aa92a7348210f9b76 Update build badge url
    • 098fefc1d383b308293ac87141fa81ca85d5952d Relocate CI pipelines under devops folder
    • cda2c1e77d2ea6d39a6e3d0583954e5ca1896c8e Merge pull request #10 from microsoft/laserprec/use_linter
    • 4c8445b8742f5eee3be23ef23e7ffcf58384af86 Fix import orders
    • f8c1e63b37ad2c6d0285ec31c920c4724d63d198 Add flake8 plugin to check import orders
    • a9620f51528c9715b33408a3ed9c902475c76eda Update build pipeline
    • 5b87f4eddb7176f901f1deb7727cf0c5ba1c90ad Fix flake8 issues
    • 96620b4c434bf18b5608d870d9d00da0150547c1 Fix linting issues in tests
    • ad504b4d7b4052bd21976802842949e80f45d1a5 Merge pull request #9 from microsoft/laserprec/hotfix_blob_version
    • f7db94fc79aa7fb9021591cf6d98918753900f34 Remove deprecated azure-storage dependencies
    • 18facb099b2a86ae03b6ea4778281415f72be84f Update CI trigger
    • 3ddee11879f582378986e3ede323ec027e8bd293 Hotfix azure blob version
    • f759f5037057ed366d961c7c54b7ce654849a4ba Merge pull request #6 from microsoft/laserprec/release-pipeline
    • d8cc7ffa8ceaf7b365881a042f23e67dcd67e075 Merge pull request #7 from microsoft/laserprec/update-build-status-badge
    • 18d505e2f180c50d3b39fb456aabad514b7305c2 Update README.md
    • d7c33cd7f65619df3ba509a6ee0bc539fe529d3b Add release pipeline
    • d83d9da888552001a21bf629de9d30c615c54844 Merge pull request #4 from microsoft/laserprec/ci-pipeline-fix
    • 4a1f8845690276aff95c0ae8d14e09a60dfd297a Add build env configurations
    • b4c7eee5037d3c41baa7c6ab50a5751095ec04fe Add absolute path to example links and update installation instruction
    • 2a4e6104a9e263a9e26377d31779251a25ac8dc5 fix badge link
    • 8b5961cc754e3f93ede3b7454f425b23a523f0bc read projection blob name from env
    • 87453a472aa708b2f03cfc11f2c70156781319c8 Merge pull request #3 from microsoft/dbanda/add_cicd
    • 7612a69ef7dcc12a657985247ebe68e412e0696e Update azure-pipeline.yml
    • f520a17e8ee12f67f544d3a774f90520d637c431 Update azure-pipeline.yml
    • cf3be451173ec1cd7c2cc5bddf0129d229fee6d5 add pipelines
    • c7d63b627f407bea7cbe7bea20f8becf1a60c31c Code Migration from Azure DevOps (#2)
    • 8ca2c5127f320fb23954cad2b85b5196126ff218 Initial README.md commit
    • eeeb6f5f92a4e2460acac0066d663341081b3f8f Initial SECURITY.md commit
    • 3d8eadcf54ac58b5a17f4678ad4e9d8f66e841a6 Initial LICENSE commit
    • b95102a864ef06a1eb141c3ca23a2643dca5d5a3 Initial CODE_OF_CONDUCT.md commit

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
    genalog-0.1.0-py3-none-any.whl(59.76 KB)
    genalog-0.1.0.tar.gz(77.18 KB)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

Ankush Gupta 1.8k Dec 28, 2022
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

jbarlow83 7.9k Jan 3, 2023
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector

CRAFT: Character-Region Awareness For Text detection Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector | Paper |

null 188 Dec 28, 2022
A curated list of awesome synthetic data for text location and recognition

awesome-SynthText A curated list of awesome synthetic data for text location and recognition and OCR datasets. Text location SynthText SynthText_Chine

Tianzhong 283 Jan 5, 2023
A synthetic data generator for text recognition

TextRecognitionDataGenerator A synthetic data generator for text recognition What is it for? Generating text image samples to train an OCR software. N

Edouard Belval 2.5k Jan 4, 2023
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

QURATOR-SPK 70 Jun 30, 2022
Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

QURATOR-SPK 70 Jun 30, 2022
Binarize document images

Binarization Binarization for document images Examples Introduction This tool performs document image binarization (i.e. transform colour/grayscale to

QURATOR-SPK 48 Jan 2, 2023
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 81 Jan 1, 2023
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

null 27 Jan 8, 2023
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 5, 2023
textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

An End-to-End TextSpotter with Explicit Alignment and Attention This is initially described in our CVPR 2018 paper. Getting Started Installation Clone

Tong He 323 Nov 10, 2022
Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

null 123 Dec 25, 2022
Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

null 68 Dec 14, 2022
The first open-source library that detects the font of a text in a image.

Typefont Typefont is an experimental library that detects the font of a text in a image. Usage Import the main function and invoke it like in the foll

Vasile Pește 1.6k Feb 24, 2022
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022