Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Overview

Table of Contents

  1. Overview
  2. Requirements
  3. Demo
  4. Modules

Overview

This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

img

Extract the the text into a CSV format…

PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"

Requirements

Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.

Demo

There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

  1. pip3 install table_ocr
  2. python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That will run against the following image:

img

The following should be printed to your terminal after running the above commands.

Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

  • pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_cells extracts and orders cells from a table.
  • ocr_image uses Tesseract to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.

#!/bin/sh

PDF=$1

python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

for image in $(cat /tmp/extracted-tables.txt); do
    dir=$(dirname $image)
    python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done

The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.

Comments
  • I can't run with any URL

    I can't run with any URL

    Hello, I open this question because I need help. May you please help me?

    I cloned the repository and following your read.me I managed to run your demo (Image 1 shows successful execution).

    Image 1

    However, I have some issues.

    1. I did not find the csv spreadsheet on my computer. I found the txt files in /var/tmp (Image 2), but I didn't find the csv spreadsheet.

    Image 2

    1. I tried to execute the same command with a URL that I sent. So I put a png image in a public GitHub repository and sent the link and I got an error (Image 3). (I used this URL: https://github.com/ajandrey/OCR/blob/main/table.png)

    Image 3

    1. I tried to run the same command again, but with a link from your page. I didn't get the same URL from your read.me file, but yes, I tried with the same image and returned the same error (Image 4). (For this, I used this URL: https://github.com/eihli/image-table-ocr/blob/master/resources/test_data/simple.png)

    Image 4

    So I can't run for any link. Questions:

    1. Does the link need to have any specifications? Can't it be any link pointing to an image?
    2. I already have the images of the tables, they are not in PDF, so I just need modules extract_cells, ocr_image, and ocr_to_csv. Can I use it to run in an image folder (of tables) for example? (Note that the error did not use only these three modules, I have not yet performed this test).

    Thank you and I look forward to your return. Alessandra Jandrey

    opened by ajandrey 9
  • Version of the external requierements

    Version of the external requierements

    first, thanks for this package its look amazing.

    help

    what is the version that i should install of:

    • pdfimages from Poppler
    • Tesseract
    • mogfrify ImageMagick
    opened by sebastiankmilo 6
  • End to End Instruction

    End to End Instruction

    Hi, glad that I found this. Kudos to the developers first of all. I was just wondering if you can provide an end to end descriptive steps from input PDF to output CSV. It's not exactly clear from the shell script you gave. Thanks!

    opened by benignavesh 6
  • Error opening data file /usr/share/tessdata/table-ocr.traineddata

    Error opening data file /usr/share/tessdata/table-ocr.traineddata

    Hello, thanks for this repo!

    It's a bit hard to understand how to get it working when you simply start with a PNG image and want to give it a try. So I'm trying with a sample file you're giving.

    I run

    python -m table_ocr.extract_tables resources/examples/example-page-table-000.png | grep table > /tmp/extracted-tables.txt
    cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
    cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr
    

    and I get

    pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
    

    I don't understand is how to get the table-ocr.traineddata file that tesseract seems to be looking for?

    Thanks again

    opened by ultrabug 5
  • Running issue with simple.png exemple under Win 10

    Running issue with simple.png exemple under Win 10

    Dear Eihli, Your program will help me in the future for personal porposes. I am running it on Win 10. I foolow all the steps to simply extract datas from images but I don't find why it does not run through it.

    Here is the message after I run py -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

    Running extract_tables.main([C:\Users\MAGICB~1\AppData\Local\Temp\demo_cp3ejb98\simple.png]). Extracted the following tables from the image: [('C:\Users\*****\AppData\Local\Temp\demo_cp3ejb98\simple.png', ['C:\Users\*****\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png'])] Processing tables for C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple.png. Processing table C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png. Traceback (most recent call last): File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 255, in run_tesseract proc = subprocess.Popen(cmd_args, **subprocess_args()) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 947, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1416, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main.py", line 51, in csv_output = main(sys.argv[1]) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 32, in main ocr = [ File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 33, in table_ocr.ocr_image.main(cell, None) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 31, in main txt = ocr_image(cropped, " ".join(tess_args)) File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 83, in ocr_image return pytesseract.image_to_string( File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 409, in image_to_string return { File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 412, in Output.STRING: lambda: run_and_get_output(args), File "C:\Users**\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 287, in run_and_get_output run_tesseract(**kwargs) File "C:\Users***\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 259, in run_tesseract raise TesseractNotFoundError() pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

    I have tesseract installed so I donnot get it: PS C:\Users*\AppData\Local\Programs\Python\Python39> py -m pip install tesseract Requirement already satisfied: tesseract in c:\users*\appdata\local\programs\python\python39\lib\site-packages (0.1.3)

    Thanks for your help.

    Eddy

    opened by eddydev03 4
  • Tessdata access error under Windows

    Tessdata access error under Windows

    Hi,

    I run the following demo command

    python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

    on Windows but got the following error:

    raise TesseractError(proc.returncode, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersjackylamAppDataLocalPackagesPythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0LocalCachelocal-packagesPython310site-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'table-ocr' Tesseract couldn't load any languages! Could not initialize tesseract.')

    I set TESSDATA_PREFIX and point to somewhere containing the table-ocr.traindata but no use.

    However, the above problem doesn't happen on Linux. As my project prefer to run on Windows, hope someone can give me some hint on this issue.

    Thanks, Sing

    opened by singsingwong2 3
  • Tesseract error in preprocessing

    Tesseract error in preprocessing

    Attempting to OCR a table and I keep getting an error. File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 69, in preprocess_img rotate = get_rotate(filepath, tess_params) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 79, in get_rotate subprocess.check_output(tess_command) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-']' returned non-zero exit status 1.

    The image is the logo at the top of the page (every page). ga-20190131-001

    opened by one2many 3
  • ModuleNotFoundError: No module named 'table_ocr' (windows/mac)

    ModuleNotFoundError: No module named 'table_ocr' (windows/mac)

    Hi - thank you for creating this - it really looks useful! When I try pip3 install table_ocr followed by python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

    I always get the same problem - No module named 'table_ocr'

    The installation runs successfully, all the dependencies are installed.

    Happens both on Windows and Mac. Am I missing something?

    image

    opened by allensh11 2
  • unable to run the code

    unable to run the code

    Can you please share the setup instructions getting below error

    "pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\Users\Ankur.Biswal\AppData\Local\Tesseract-OCR\tessdata/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"

    opened by AnkurAlankarBiswal 1
  • Traineddata path issue on Windows 10.

    Traineddata path issue on Windows 10.

    When i run

    python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

    i get

    pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersGetyAppDataLocalProgramsPythonPython38libsite-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

    (note file path does not have '/')

    File does exist

    I tried setting env variable TESSDATA_PREFIX - same error.

    as well as specifying path in cli python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png --tessdata-dir C:\Users\Btycoon\AppData\Local\Programs\Python\Python38\Lib\site-packages\table_ocr\tessdata

    I am on Windows 10.

    opened by gety9 5
  • UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

    UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

    Traceback (most recent call last): File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 51, in csv_output = main(sys.argv[1]) File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in main for cell in cells File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in for cell in cells File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/ocr_image/init.py", line 33, in main txt_file.write(txt) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

    in some cases, we get this issue it can't be fixed by adding this line of code in "/image-table-ocr/table_ocr/ocr_image /init.py" line 32 :

    txt = txt.encode('ascii', 'ignore').decode('ascii')

    opened by chouroukhelaoui 2
  • Merging columns are not able to be detected

    Merging columns are not able to be detected

    Dear @eihli ,

    Thank you very much for your project. It works great! I have not fully understood your detection algorithms yet, but I think there is this issue, which would be great to improve the accuracy of your package. I noticed that in the case some columns are merged, the program will cut it followed by the major columns. Besides, your program works well in case of rows are merged: Here is the example: table_to_cut_vertical

    The extract_cell_images_from_table method 's results:

    table_type1_indexed10

    I will take a look deeper into the code, meanwhile, I think it's better to report this to you so that the library can be enhanced in the future. Asides from this minor issue, your library is awesome.

    Thanks again and best regards

    opened by anhhaibkhn 1
  • No way to get hocr of the image with the table_ocr library

    No way to get hocr of the image with the table_ocr library

    We use the below config to get the table ocr, but there is no way to get hocr of the image. can someone add this feature please? d = os.path.dirname(sys.modules["table_ocr"].__file__) tessdata_dir = os.path.join(d, "tessdata") tess_args = "--psm 6 -l table-ocr --tessdata-dir {0}".format(tessdata_dir)

    opened by binayr 2
Owner
Eric Ihli
Eric Ihli
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 4, 2023
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Python package for handwriting and sketching in Jupyter cells

ipysketch A Python package for handwriting and sketching in Jupyter notebooks. Usage A movie is worth a thousand pictures is worth a million words...

Matthias Baer 16 Jan 5, 2023
Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

OCR-D 38 Oct 14, 2022
Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Mathematical formulae extractor The goal of this project is to create a learning based system that takes an image of a math formula and returns corres

null 6 May 22, 2022
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

OCRopus 285 Dec 8, 2022
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

OCRopus 285 Dec 8, 2022
Detect and fix skew in images containing text

Alyn Skew detection and correction in images containing text Image with skew Image after deskew Install and use via pip! Recommended way(using virtual

Kakul 230 Dec 21, 2022
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 5, 2023
Generate text images for training deep learning ocr model

New version release:https://github.com/oh-my-ocr/text_renderer Text Renderer Generate text images for training deep learning OCR model (e.g. CRNN). Su

Qing 1.2k Jan 4, 2023
A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

Weverton Marques 4 Aug 6, 2021
Machine Leaning applied to denoise images to improve OCR Accuracy

Machine Learning to Denoise Images for Better OCR Accuracy This project is an adaptation of this tutorial and used only for learning purposes: https:/

Antonio Bri Pérez 2 Nov 16, 2022
Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

QURATOR-SPK 70 Jun 30, 2022
Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

QURATOR-SPK 70 Jun 30, 2022
This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

Welcome This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images. Installation There are curren

null 8 Jul 29, 2022
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 8, 2021