Extract tables from scanned image PDFs using Optical Character Recognition.

Abhijeet Singh

Last update: Dec 6, 2022

Related tags

Computer Vision python shell ocr tesseract optical-character-recognition pdfminer extract-tables scanned-image-pdfs ocr-table

Overview

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

Tesseract OCR
```
sudo apt-get install tesseract-ocr
```
Imagemagick
```
sudo apt-get install imagemagick
```
PDF Utilities
```
sudo apt-get install poppler-utils
```
Python packages
```
sudo pip install -r requirements.txt
```

Usage

Clear the pdf/ folder and copy all your pdf files to be scanned in it.
Run the OCR:
```
python3 shellocr.py
```
The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

If the above doesn't work for you, try the alternate method.
Save your file as input.pdf in the root directory.
Run
```
python3 pdf_miner.py 
```

Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

2 Oct 11, 2021

Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

127 Dec 3, 2022

Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

15 Nov 9, 2022

Handwritten Number Recognition using CNN and Character Segmentation

Handwritten-Number-Recognition-With-Image-Segmentation Info About this repository This Repository is aimed at reading handwritten images of numbers an

17 Aug 25, 2022

A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

27 Dec 7, 2022

Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

273 Jan 6, 2023

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

243 Dec 30, 2022

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

32 Jul 24, 2022

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Scan Tailor - scantailor.org This project is no longer maintained, and has not been maintained for a while. About Scan Tailor is an interactive post-p

1.5k Dec 28, 2022

Comments

cannot remove './temp.tiff': No such file or directory

Hello,

I'm running to the following issue when trying to load my own file:

$ pipenv run python3 shellocr.py
Attempting pdftotext extraction...extracted 0 words.
Attempting OCR extraction...rm: cannot remove './temp.tiff': No such file or directory
extracted 0 words.

Additionally, trying to parse the same file with pdfminer returns the following:

$ python3 pdf_miner.py
b'\x0c'

opened by zexa 2

ModuleNotFoundError: No module named 'chardet'

Traceback (most recent call last):
  File "pdf_miner.py", line 8, in <module>
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  File "/anonymized/ocr-table/venv3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 5, in <module>
    from .cmapdb import CMapDB
  File "/anonymized/ocr-table/venv3/lib/python3.7/site-packages/pdfminer/cmapdb.py", line 24, in <module>
    from .psparser import PSStackParser
  File "/anonymized/ocr-table/venv3/lib/python3.7/site-packages/pdfminer/psparser.py", line 11, in <module>
    from .utils import choplist
  File "/anonymized/ocr-table/venv3/lib/python3.7/site-packages/pdfminer/utils.py", line 13, in <module>
    import chardet  # For str encoding detection in Py3
ModuleNotFoundError: No module named 'chardet'

opened by souravbadami 1

Extracting table data?

Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.

Thanks in advance. :)
enhancement help wanted

opened by munikarmanish 5

Owner

Abhijeet Singh

Mozilla Rep | Software Engineer

GitHub

Extract tables from scanned image PDFs using Optical Character Recognition.

Related tags

Overview

ocr-table

Install Requirements

Usage

Alternate

You might also like...

Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Some bits of javascript to transcribe scanned pages using PageXML

Handwritten Number Recognition using CNN and Character Segmentation

A post-processing tool for scanned sheets of paper.

Library used to deskew a scanned document

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Comments

cannot remove './temp.tiff': No such file or directory

ModuleNotFoundError: No module named 'chardet'

Extracting table data?

Owner

Abhijeet Singh

Python library to extract tabular data from images and scanned PDFs

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

Text recognition (optical character recognition) with deep learning methods.

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

Provides OCR (Optical Character Recognition) services through web applications

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

Optical character recognition for Japanese text, with the main focus being Japanese manga