Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Overview

doc2text


Signup for Announcements doc2text Example

doc2text extracts higher quality text by fixing common scan errors

Developing text corpora can be a massive pain in the butt. Much of the text data we are interested in as scientists are locked away in pdfs that are poorly scanned. These scans can be off kilter, poor resolution, have a hand in them... and if you OCR these scans without fixing these errors, the OCR doesn't turn out so well. doc2text was created to help researchers fix these errors and extract the highest quality text from their pdfs as possible.

doc2text is super duper alpha atm

doc2text is developed and tested on Ubuntu 16.04 LTS Xenial Xerus. We do not pretend to serve all operating systems at the moment because that would be irresponsible. Please use this software with a huge grain of salt. We are currently working on:

  • Increasing the responsiveness of the text block identifier.
  • Optimizing the binarization for tesseract detection.
  • Identifying text in multiple columns (right now, treats as one big column).
  • Handling tables.
  • Many other optimizations.

Support and Contributions

If you have feedback or would like to contribute, please, please submit a pull request or contact me at joseph dot sutherland at columbia dot edu.

Installation

To install the doc2text package, simply:

pip install doc2text

doc2text relies on the OpenCV, tesseract, and PythonMagick libraries. To execute the quick-install script, which installs OpenCV, tesseract, and PythonMagick:

curl https://raw.githubusercontent.com/jlsutherland/doc2text/master/install_deps.sh | bash

Manual installation

To install OpenCV manually:

sudo apt-get install -y build-essential
sudo apt-get install -y cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install -y python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev
git clone https://github.com/opencv/opencv.git opencv
git clone https://github.com/opencv/opencv_contrib.git opencv_contrib
cd opencv
git checkout 3.1.0
cd ../opencv_contrib
git checkout 3.1.0
cd ../opencv
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D INSTALL_PYTHON_EXAMPLES=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D BUILD_EXAMPLES=ON ..
make -j4
sudo make install
sudo ldconfig

To install tesseract manually:

sudo apt-get install tesseract-ocr

To install PythonMagick manually:

sudo apt-get install python-pythonmagick

Example usage

import doc2text

# Initialize the class.
doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

Big thanks

doc2text would be nothing without the open-source contributions of:

Comments
  • Fixed issue with wrong number of variables in function return

    Fixed issue with wrong number of variables in function return

    _, contours, hierarchy = cv2.findContours(dilation, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) This ^ actually returns two arguments instead of three for me. Version differences? And please do not treat exceptions like you do - it was very hard to find out what happened because of crop() silently excepted without any signs, though leaving self.image unset, which lead to error in deskew() 'Page does not have member 'image'' Btw I managed to run the app after this commit

    opened by achikin 6
  • Can not install pythonmagick.

    Can not install pythonmagick.

    I tried sudo apt-get install python-pythonmagick and pip install to ensure doc2text import well. But even I successfully installed python-pythonmagick via apt, I still can not import doc2text.

    I check the source package which python-pythonmagick installed via apt, it seems the packages can only support python2.

    So could you help to fix the problem? I want to doc2text on python3 (Ubuntu)

    opened by dyllanwli 2
  • Eror on pip install PythonMagick

    Eror on pip install PythonMagick

    PythonMagick is a required package for doc2text. I installed it through pip.

    (doc2txt) ➜  Programs pip install PythonMagick
    Collecting PythonMagick
      Could not find a version that satisfies the requirement PythonMagick (from versions: )
    No matching distribution found for PythonMagick
    

    Anyone knows what's wrong with it...thanks.

    opened by liber145 2
  • Add supports for lang parameter

    Add supports for lang parameter

    This allow to initialize the Document class with a lang that will be passed to tesseract. (Giving tesseract a language sometimes greatly improve text extraction quality).

    On ubuntu this requires to install the package tesseract-ocr-$lang$ where $lang$ is the 3 letter code for the language. On other OS, lang data for tesseract can be found at https://github.com/tesseract-ocr/langdata

    opened by rcatajar 2
  • Does not work on python3

    Does not work on python3

    I installed with pip install doc2text, then tried in an ipython shell to import doc2text. This gave error in init.py line 77 because of print statement with no parantheses.

    opened by lervag 2
  • AttributeError: 'Page' object has no attribute 'image' ISSUE

    AttributeError: 'Page' object has no attribute 'image' ISSUE

    hi there I am testing your product, however I am getting this type of error:

    Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25 dst is not a numpy array, neither a scalar Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211 dst is not a numpy array, neither a scalar Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80 dst is not a numpy array, neither a scalar Traceback (most recent call last): File "example_doc2text.py", line 19, in doc.extract_text() File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 96, in extract_text text = new.extract_text() File "/usr/local/lib/python2.7/dist-packages/doc2text/page.py", line 46, in extract_text cv2.imwrite(temp_path, self.image) AttributeError: 'Page' object has no attribute 'image'

    my test files is as follow:

    > import doc2text
    > 
    > # Initialize the class.
    > doc = doc2text.Document()
    > 
    > # You can pass the lang (as 3 letters code) to the class to improve accuracy
    > # On ubuntu it requires the package tesseract-ocr-$lang$
    > # On other OS, see https://github.com/tesseract-ocr/langdata
    > doc = doc2text.Document(lang="eng")
    > 
    > # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
    > # If reading a PDF, doc2text will split the PDF into its component pages.
    > doc.read('myfile.tiff')
    > 
    > # Crop the pages down to estimated text regions, deskew, and optimize for OCR.
    > doc.process()
    > 
    > # Extract text from the pages.
    > doc.extract_text()
    > text = doc.get_text()
    > print text
    

    could you please help me? thanks a lot

    opened by angelo337 1
  • Error passing the lang to the class

    Error passing the lang to the class

    When I try to pass the language as in the example:

    doc = doc2text.Document(lang="por")
    

    I received the following error message:

        doc = doc2text.Document(lang="por")
    TypeError: __init__() got an unexpected keyword argument 'lang'
    
    opened by crgimenes 1
  • Compile opencv in /tmp

    Compile opencv in /tmp

    It avoids having opencv and opencv_contrib in working directory after installation. /tmp dir is cleared at boot time, but maybe we also want to manually remove the folders after installation.

    Also, FYI your installation script also work with Ubuntu 14.04

    opened by rcatajar 1
  • issue with extract_text

    issue with extract_text

    When doing:

    import doc2text
    doc = doc2text.Document()
    doc.read('something.pdf')
    doc.process()
    doc.extract_text()
    

    I get the following error:

    AttributeError                            Traceback (most recent call last)
    <ipython-input-5-57184997370d> in <module>()
    ----> 1 doc.extract_text()
    
    /usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
         89             for page in self.processed_pages:
         90                 new = page
    ---> 91                 text = new.extract_text()
         92                 self.page_content.append(text)
         93         else:
    
    /usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
         36     def extract_text(self):
         37         temp_path = 'text_temp.png'
    ---> 38         cv2.imwrite(temp_path, self.image)
         39         self.text = pytesseract.image_to_string(Image.open(temp_path))
         40         os.remove(temp_path)
    
    AttributeError: Page instance has no attribute 'image'
    
    
    opened by rsteca 1
  • Fixes 'Document instance has no attribute 'file_basename''

    Fixes 'Document instance has no attribute 'file_basename''

    Fixes the following issue

    import doc2text doc = doc2text.Document() doc.read('image.png') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 67, in read self.file_basepath, self.file_basename + '_temp.png' AttributeError: Document instance has no attribute 'file_basename'

    opened by achikin 1
  • AttributeError: Document instance has no attribute 'file_basename'

    AttributeError: Document instance has no attribute 'file_basename'

    >>> import doc2text
    >>> doc = doc2text.Document()
    >>> doc.read('test.png')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/jwilk/.local/lib/python2.7/site-packages/doc2text/__init__.py", line 67, in read
        self.file_basepath, self.file_basename + '_temp.png'
    AttributeError: Document instance has no attribute 'file_basename'
    

    Tested with git master (41dca91dda625b11633df77e45401787ea5a55a5).

    opened by jwilk 1
  • Does is support stream data ?

    Does is support stream data ?

    I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on the disk . Is there any way ? I'm trying to push the stream object so its giving me the error.

    code : file = request.files['file'] file_data = file.stream.read()

    error:

    \venv\lib\site-packages\docx2txt\docx2txt.py", line 76, in process zipf = zipfile.ZipFile(docx) File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1225, in init self._RealGetContents() File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1288, in _RealGetContents endrec = _EndRecData(fp) File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 259, in _EndRecData fpin.seek(0, 2) AttributeError: 'bytes' object has no attribute 'seek'

    opened by multinucliated 0
  • ModuleNotFoundError: No module named 'PyPDF2'

    ModuleNotFoundError: No module named 'PyPDF2'

    Traceback (most recent call last):
      File "test.py", line 1, in <module>
        import doc2text
      File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in <module>
        import PyPDF2 as pyPdf
    ModuleNotFoundError: No module named 'PyPDF2'
    
    opened by alexauvray 1
  • Python 3.5 compatibility

    Python 3.5 compatibility

    Seems library not 100% python3 compatible. When I'm tying to run simple code:

    import doc2text
    
    doc = doc2text.Document()
    doc = doc2text.Document(lang="eng")
    doc.read('pdf-sample.pdf')
    
    

    I'm getting

    Traceback (most recent call last):
      File "doc2text_test.py", line 13, in <module>
        doc.read('pdf-sample.pdf')
      File "/usr/local/lib/python3.5/dist-packages/doc2text/__init__.py", line 44, in read
        for i in xrange(self.num_pages):
    NameError: name 'xrange' is not defined
    
    opened by andjelx 6
  • text extraction from png files does not seem to work

    text extraction from png files does not seem to work

    Thank you for this fantastic utility.

    Text extraction is not successful for any png image with texts. The jpg and pdf works. Is this a known issue and will there be a fix..thanks.

    opened by vsriram28 0
Owner
Joe Sutherland
Head of Data Science at Search Discovery. @Columbia, @WUSTL grad. @WhiteHouse, @OFA alum.
Joe Sutherland
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 3, 2022
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Abhijeet Singh 209 Dec 6, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

jbarlow83 7.9k Jan 3, 2023
Recognizing the text contents from a scanned visiting card

Recognizing the text contents from a scanned visiting card. The application which is used to recognize the text from scanned images,printeddocuments,r

Faizan Habib 1 Jan 28, 2022
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

Hans Alemão 2 Feb 22, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

null 27 Dec 7, 2022
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Stéphane Brunner 273 Jan 6, 2023
Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Andreas Büttner 15 Nov 9, 2022
scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Scan Tailor - scantailor.org This project is no longer maintained, and has not been maintained for a while. About Scan Tailor is an interactive post-p

null 1.5k Dec 28, 2022