Convert tables stored as images to an usable .csv file

Overview

Convert an image of numbers to a .csv file

This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for Python to process the given image and Tesseract for number recognition.

Output Example

The repository includes:

  • the source code of image2csv.py,
  • the tools.py file where useful functions are implemented,
  • the grid_detector.py file to perform automatic grid detection,
  • a folder with some files used for test.

The code is not well documented nor fully efficient as I'm a beginner in programming, and this project is a way for me to improve my skills, in particular in Python programming.

How to use the program

First of all, the user must install the needed packages:

$ pip install -r requirements.txt   

as well as Tesseract.

Then, in a python terminal, use the command line:

$ python image2csv.py --image path/to/image

There are a few optionnal arguments:

  • --path path/to/output/csv/file
  • --grid [False]/True
  • --visualization [y]/n
  • --method [fast]/denoize

and one can find their usage using the command line:

$ python image2csv.py --help

By default, the program will try to detect a grid automatically. This detection uses OpenCV's Hough transformation and Canny detection, so the user can tweak a few parameters for better processing in the grid_detector.py file.

When then program is running with manual grid detection, the user has to interact with it via its mouse and the terminal :

  1. the image is opened in a window for the user to draw a rectangle around the first (top left) number. As this rectangle is used as a base to create a grid afterward, keep in mind that all the numbers should fit into the box.
  2. A new window is opened showing the image with the drawn rectangle. Press any key to close and continue.
  3. Based on the drawn rectangle, a grid is created to extract each number one by one. This grid is controlled by the user via two "offset" values. The user has to enter those values in the terminal, then the image is opened in a window with the created grid. Press any key to close and continue. If the numbers does not fit into the grid, the user can change the offset values and repeat this step. When the grid matches the user's expectations, he can set both of the offset values to 0 to continue.
  4. The numbers are extracted from the image and the results are shown in the terminal. (be carefoul though, the indicated number of errors represents the number of errors encountered by Tesseract, but Tesseract can identify a wrong number which will not be counted as an error !)
  5. The .csv file is created with the numbers identified by Tesseract. If Tesseract finds an error, it will show up on the .csv file as an infinite value.

Hypothesis and limits

For the program to run correctly, the input image must verify some hypothesis (just a few simple ones):

  • for manual selection, the line and row width must be constants, as the build grid is just a repetition of the initial rectangle with offsets;
  • to use automatic grid detection, a full and clear grid, with external borders, must be visible;
  • it is recommended to have a good input image resolution, to control the offsets more easily.

At last, this program is not perfect (I know you thought so, with its smooth workflow and simple hypothesis, sorry to disappoint...) and does not work with decimal numbers... But does a great job on negatives ! Also the user must be careful with the slashed zero which seems to be identified by Tesseract as a six.

Credits

For image pre-processing in the tool.py file I used a useful function implemented by @Nitish9711 for his Automatic-Number-plate-detection (https://github.com/Nitish9711/Automatic-Number-plate-detection.git).

Comments
  • questions for the program running

    questions for the program running

    I am a new learner. I have cloned the project and run it on the prcharm via anaconda environments. I didnot know which file should be employed and can get the result. I run the grid_detector.py, where have "main" function , but just output the figure.

    opened by emozhishou 2
  • {Question} Any Idea to Make it work on this image

    {Question} Any Idea to Make it work on this image

    I have this image from which I want to extract text:

    image

    This tool is not seem to be working on this. I don't know what to do. It generates the following output.

    Screenshot from 2021-09-28 15-25-24

    Do you have any idea how to fix this. Thanks.

    opened by ilovefreesw 1
  • General code enhancement and new ideas

    General code enhancement and new ideas

    I saw that you are new to coding, so welcome to GitHub !

    I listed a few things that you could easily change to improve the quality of your code:

    • Use f-string instead of ugly concatenation like print("[INFO] End of OCR, found "+str(NbError)+" errors out of "+str(len(ROI))+" regions...")

      • It will look instead like : print(f"[INFO] End of OCR, found {str(NbError)} errors out of {str(len(ROI))} regions...")
      • It's new in Python 3.6
      • See https://www.python.org/dev/peps/pep-0498/
    • Use logging module instead of print("[INFO"] ...)

      • You can use logging.info() to print a message with the info level
      • You can add an argument to your CLI (that means command line tool), to set the verbosity level (info, warning, error, critical, ...)
      • See https://docs.python.org/3/howto/logging.html#logging-basic-tutorial
    • Use a standard docstring notation like the one of google or the one of numpy:

      • The Vscode extension Python Docstring Generator can generate the docstring based of the prototype function. You can specify if you want the numpy or the google style in the extension configuration.
      • See https://numpydoc.readthedocs.io/en/latest/format.html for numpy style (I suggest you this one)
      • See https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html for google style
    • DO NOT write the .gitignore yourself

      • websites exist to generate gitignore corresponding to your project such as https://gitignore.io/
      • just write the technology you're using (for example vscode and python here)
    • In the requirements.txt, always precise the package version you're using

      • for example opencv-python==4.5.1.48
      • to do so the easiest way is to:
        • create a virtualenv pip3 install virtualenv
        • create such a virtualenv in your project by executing python3 -m venv .venv
        • activate the virtualenv venv\Scripts\activate (on Windows)
        • make the pip install -r requirements.txt
        • finally you can get the version you installed by executing pip freeze and put this in your requirements.txt instead
      • to work with a virtualenv is a very good habit to take because you're sure to use the good versions of your project
      • see https://www.tutorialspoint.com/python-virtual-environment
    • You should use a linter to keep the way you're coding consistent.

      • In Vscode, which you seem to be using, it's very easy.
      • I suggest you the black linter which can be installed by pip install black
      • If you press Shift+Ctrl+P in Vscode and you type format document, Vscode will propose you the install of black. After that you can format the document by pressing the keyboard shortcut Shift+Ctrl+I
      • image
      • See https://pypi.org/project/black/
    • In Vscode, use the Python extension but also the Visual Studio IntelliCode and Pylance

      • It has a AI support to check your code statically, and to report error in your code.
      • It can suggest you the method name based on what you're typing, print help snippet to tell you what are the args of the function, what it returns, ...
      • Screenshot from 2021-02-04 09-10-41
      • Try it, it's :heart: !
    • Small thing, but put a LICENSE to your project (if you want it to be free, MIT License is a good choice)

    • If you want to go further in devellopping pattern, you can write test for your code.

      • you could easily test with the pytest framework
      • here it's pretty easy to make functionnals tests. Run your code on sample picture and compare the output csv with a reference csv which contains the picture data. Do this on some sample images.
      • See https://docs.pytest.org/en/stable/
      • See https://openclassrooms.com/fr/courses/6100311-testez-votre-code-java-pour-realiser-des-applications-de-qualite/6616481-decouvrez-les-tests-dintegration-et-les-tests-fonctionnels (it's in a Java tutorial but there is no Java here, it's only the principles)

    Be sure that it's all about the form and not about the content. The goal of this remarks is simply to make you aware of the existence of such things (ignored by far too many programmers and teachers) and to step up your code quality :smile:

    opened by thibaultserti 1
  • Slashed zero identified as a six

    Slashed zero identified as a six

    When the table contains a slashed zero, Tesseract identifies it as a six.

    The segmentation does not "erase" a part of the said character and is doing well:

    zeroSeg

    But the output of Tesseract is unfortunately wrong:

    zeroOut

    bug 
    opened by artperrin 0
  • Pre-processing enhancement

    Pre-processing enhancement

    The pre-processing function of the tool.py file does some image segmentation to each region for Tesseract to identify the region's number. But when the input image has a grid, and fragments of this grid appears on a region, Tesseract generates an error.

    error-grid

    This trouble forces the user to be carefoul when drawing the first rectangle and setting the offset --- it can be very upsetting.

    It seems that the grid could be removed from each regions with some elementary image segmentation using OpenCV. At the time, I can think of using a clear border function (like imclearborder in MatLab) or trying to detect the grid's lines and remove them.

    enhancement 
    opened by artperrin 1
Owner
Beginning in the programming world with the help of @29jm, holy builder of the very special SnowflakeOS. Student at the École Centrale de Lille (FR).
null
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 2, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project > Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

null 32 Apr 25, 2022
CSV database for chihuahua (HUAHUA) blockchain transactions

super-fiesta Shamelessly ripped components from https://github.com/hodgerpodger/staketaxcsv - Thanks for doing all the hard work. This code does only

Arlene Macciaveli 1 Jan 7, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

null 2 Jul 22, 2022
Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

null 1 Nov 21, 2021
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 5, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
AnonStress-Stored-XSS-Exploit - An exploit and demonstration on how to exploit a Stored XSS vulnerability in anonstress

AnonStress Stored XSS Exploit An exploit and demonstration on how to exploit a S

صلى الله على محمد وآله 3 Jun 22, 2022
Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

Evan 1 Dec 14, 2021
Module for converting 2D Python lists to fancy ASCII tables. Table2Ascii lets you display pretty tables in the terminal and on Discord.

table2ascii Module for converting 2D Python lists to a fancy ASCII/Unicode tables table2ascii ?? Installation ??‍?? Usage Convert lists to ASCII table

Jonah Lawrence 40 Jan 3, 2023
Two scripts help you to convert csv file to md file by template

Two scripts help you to convert csv file to md file by template. One help you generate multiple md files with different filenames from the first colume of csv file. Another can generate one md file with several blocks.

null 2 Oct 15, 2022
DB-Drive-CSV - This is app is can be used to access CSV file as JSON from Google Drive.

DB Drive CSV This is app is can be used to access CSV file as JSON from Google Drive. How To Use Create file/ upload file to Google Drive There's 2 fi

Hartawan Bahari M. 5 Oct 20, 2022
Beacon Object File (BOF) to obtain a usable TGT for the current user.

Beacon Object File (BOF) to obtain a usable TGT for the current user.

Connor McGarr 109 Dec 25, 2022
QueraToCSV is a simple python CLI project to convert the Quera results file into CSV files.

Quera is an Iranian Learning management system (LMS) that has an online judge for programming languages. Some Iranian universities use it to automate the evaluation of programming assignments.

Amirmahdi Namjoo 16 Nov 11, 2022
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022