Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Eric Ihli

Last update: Dec 24, 2022

Related tags

Computer Vision image-table-ocr

Overview

Overview
Requirements
Demo
Modules

Overview

This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

Extract the the text into a CSV format…

PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"

Requirements

Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.

pdfimages 20.09.0 of Poppler
tesseract 5.0.0 of Tesseract
mogrify 7.0.10 of ImageMagick

Demo

There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That will run against the following image:

The following should be printed to your terminal after running the above commands.

Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
extract_tables finds and extracts table-looking things from an image.
extract_cells extracts and orders cells from a table.
ocr_image uses Tesseract to OCR the text from an image of a cell.
ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.

#!/bin/sh

PDF=$1

python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

for image in $(cat /tmp/extracted-tables.txt); do
    dir=$(dirname $image)
    python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done

The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.

Comments

I can't run with any URL
Hello, I open this question because I need help. May you please help me?

I cloned the repository and following your read.me I managed to run your demo (Image 1 shows successful execution).

However, I have some issues.

I did not find the csv spreadsheet on my computer. I found the txt files in /var/tmp (Image 2), but I didn't find the csv spreadsheet.

I tried to execute the same command with a URL that I sent. So I put a png image in a public GitHub repository and sent the link and I got an error (Image 3). (I used this URL: https://github.com/ajandrey/OCR/blob/main/table.png)

I tried to run the same command again, but with a link from your page. I didn't get the same URL from your read.me file, but yes, I tried with the same image and returned the same error (Image 4). (For this, I used this URL: https://github.com/eihli/image-table-ocr/blob/master/resources/test_data/simple.png)

So I can't run for any link. Questions:

Does the link need to have any specifications? Can't it be any link pointing to an image?

I already have the images of the tables, they are not in PDF, so I just need modules extract_cells, ocr_image, and ocr_to_csv. Can I use it to run in an image folder (of tables) for example? (Note that the error did not use only these three modules, I have not yet performed this test).

Thank you and I look forward to your return. Alessandra Jandrey
opened by ajandrey 9
Version of the external requierements
first, thanks for this package its look amazing.

help

what is the version that i should install of:

pdfimages from Poppler

Tesseract

mogfrify ImageMagick
opened by sebastiankmilo 6
End to End Instruction

Hi, glad that I found this. Kudos to the developers first of all. I was just wondering if you can provide an end to end descriptive steps from input PDF to output CSV. It's not exactly clear from the shell script you gave. Thanks!

opened by benignavesh 6

Error opening data file /usr/share/tessdata/table-ocr.traineddata

Hello, thanks for this repo!

It's a bit hard to understand how to get it working when you simply start with a PNG image and want to give it a try. So I'm trying with a sample file you're giving.

I run

python -m table_ocr.extract_tables resources/examples/example-page-table-000.png | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr

and I get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

I don't understand is how to get the table-ocr.traineddata file that tesseract seems to be looking for?

Thanks again

opened by ultrabug 5

Running issue with simple.png exemple under Win 10

Dear Eihli, Your program will help me in the future for personal porposes. I am running it on Win 10. I foolow all the steps to simply extract datas from images but I don't find why it does not run through it.

Here is the message after I run py -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

Running extract_tables.main([C:\Users\MAGICB~1\AppData\Local\Temp\demo_cp3ejb98\simple.png]). Extracted the following tables from the image: [('C:\Users\*****\AppData\Local\Temp\demo_cp3ejb98\simple.png', ['C:\Users\*****\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png'])] Processing tables for C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple.png. Processing table C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png. Traceback (most recent call last): File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 255, in run_tesseract proc = subprocess.Popen(cmd_args, **subprocess_args()) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 947, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1416, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main.py", line 51, in csv_output = main(sys.argv[1]) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 32, in main ocr = [ File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 33, in table_ocr.ocr_image.main(cell, None) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 31, in main txt = ocr_image(cropped, " ".join(tess_args)) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 83, in ocr_image return pytesseract.image_to_string( File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 409, in image_to_string return { File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 412, in Output.STRING: lambda: run_and_get_output(args), File "C:\Users**\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 287, in run_and_get_output run_tesseract(**kwargs) File "C:\Users***\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 259, in run_tesseract raise TesseractNotFoundError() pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

I have tesseract installed so I donnot get it: PS C:\Users*\AppData\Local\Programs\Python\Python39> py -m pip install tesseract Requirement already satisfied: tesseract in c:\users*\appdata\local\programs\python\python39\lib\site-packages (0.1.3)

Thanks for your help.

Eddy

opened by eddydev03 4
Tessdata access error under Windows

Hi,

I run the following demo command

python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

on Windows but got the following error:

raise TesseractError(proc.returncode, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersjackylamAppDataLocalPackagesPythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0LocalCachelocal-packagesPython310site-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'table-ocr' Tesseract couldn't load any languages! Could not initialize tesseract.')

I set TESSDATA_PREFIX and point to somewhere containing the table-ocr.traindata but no use.

However, the above problem doesn't happen on Linux. As my project prefer to run on Windows, hope someone can give me some hint on this issue.

Thanks, Sing

opened by singsingwong2 3
Tesseract error in preprocessing

Attempting to OCR a table and I keep getting an error. File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 69, in preprocess_img rotate = get_rotate(filepath, tess_params) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 79, in get_rotate subprocess.check_output(tess_command) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-']' returned non-zero exit status 1.

The image is the logo at the top of the page (every page).

opened by one2many 3
ModuleNotFoundError: No module named 'table_ocr' (windows/mac)

Hi - thank you for creating this - it really looks useful! When I try pip3 install table_ocr followed by python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

I always get the same problem - No module named 'table_ocr'

The installation runs successfully, all the dependencies are installed.

Happens both on Windows and Mac. Am I missing something?

opened by allensh11 2
unable to run the code

Can you please share the setup instructions getting below error

"pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\Users\Ankur.Biswal\AppData\Local\Tesseract-OCR\tessdata/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"

opened by AnkurAlankarBiswal 1
Traineddata path issue on Windows 10.

When i run

python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

i get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersGetyAppDataLocalProgramsPythonPython38libsite-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

(note file path does not have '/')

File does exist

I tried setting env variable TESSDATA_PREFIX - same error.

as well as specifying path in cli python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png --tessdata-dir C:\Users\Btycoon\AppData\Local\Programs\Python\Python38\Lib\site-packages\table_ocr\tessdata

I am on Windows 10.

opened by gety9 5
$UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)$

UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

Traceback (most recent call last): File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 51, in csv_output = main(sys.argv[1]) File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in main for cell in cells File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in for cell in cells File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/ocr_image/init.py", line 33, in main txt_file.write(txt) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

in some cases, we get this issue it can't be fixed by adding this line of code in "/image-table-ocr/table_ocr/ocr_image /init.py" line 32 :

txt = txt.encode('ascii', 'ignore').decode('ascii')

opened by chouroukhelaoui 2
Merging columns are not able to be detected

Dear @eihli ,

Thank you very much for your project. It works great! I have not fully understood your detection algorithms yet, but I think there is this issue, which would be great to improve the accuracy of your package. I noticed that in the case some columns are merged, the program will cut it followed by the major columns. Besides, your program works well in case of rows are merged: Here is the example:

The extract_cell_images_from_table method 's results:

I will take a look deeper into the code, meanwhile, I think it's better to report this to you so that the library can be enhanced in the future. Asides from this minor issue, your library is awesome.

Thanks again and best regards

opened by anhhaibkhn 1
No way to get hocr of the image with the table_ocr library

We use the below config to get the table ocr, but there is no way to get hocr of the image. can someone add this feature please? d = os.path.dirname(sys.modules["table_ocr"].__file__) tessdata_dir = os.path.join(d, "tessdata") tess_args = "--psm 6 -l table-ocr --tessdata-dir {0}".format(tessdata_dir)

opened by binayr 2

Owner

Eric Ihli

GitHub

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

4 Jul 11, 2022

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

1.3k Jan 4, 2023

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

5 Dec 6, 2021

Python package for handwriting and sketching in Jupyter cells

ipysketch A Python package for handwriting and sketching in Jupyter notebooks. Usage A movie is worth a thousand pictures is worth a million words...

16 Jan 5, 2023

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

38 Oct 14, 2022

Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Mathematical formulae extractor The goal of this project is to create a learning based system that takes an image of a math formula and returns corres

6 May 22, 2022

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

Detect and fix skew in images containing text

Alyn Skew detection and correction in images containing text Image with skew Image after deskew Install and use via pip! Recommended way(using virtual

230 Dec 21, 2022

OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

144 Jan 5, 2023

Generate text images for training deep learning ocr model

New version release：https://github.com/oh-my-ocr/text_renderer Text Renderer Generate text images for training deep learning OCR model (e.g. CRNN). Su

1.2k Jan 4, 2023

A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

4 Aug 6, 2021

Machine Leaning applied to denoise images to improve OCR Accuracy

Machine Learning to Denoise Images for Better OCR Accuracy This project is an adaptation of this tutorial and used only for learning purposes: https:/

2 Nov 16, 2022

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

DeepSceneTextReader This is a c++ project deploying a deep scene text reading pipeline. It reads text from natural scene images. Prerequsites The proj

49 Sep 10, 2022

This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

Welcome This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images. Installation There are curren

8 Jul 29, 2022

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics

21 Dec 8, 2021

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Related tags

Overview

Table of Contents

Overview

Requirements

Demo

Modules

Comments

help

Owner

Eric Ihli

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Indonesian ID Card OCR using tesseract OCR

Python package for handwriting and sketching in Jupyter cells

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Detect and fix skew in images containing text

OCR system for Arabic language that converts images of typed text to machine-encoded text.

Generate text images for training deep learning ocr model

A bot that extract text from images using the Tesseract OCR.

Machine Leaning applied to denoise images to improve OCR Accuracy

Detect textlines in document images

Detect textlines in document images

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data