Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Joe Sutherland

Last update: Jan 4, 2023

Related tags

Computer Vision doc2text

Overview

doc2text

doc2text extracts higher quality text by fixing common scan errors

Developing text corpora can be a massive pain in the butt. Much of the text data we are interested in as scientists are locked away in pdfs that are poorly scanned. These scans can be off kilter, poor resolution, have a hand in them... and if you OCR these scans without fixing these errors, the OCR doesn't turn out so well. doc2text was created to help researchers fix these errors and extract the highest quality text from their pdfs as possible.

doc2text is super duper alpha atm

doc2text is developed and tested on Ubuntu 16.04 LTS Xenial Xerus. We do not pretend to serve all operating systems at the moment because that would be irresponsible. Please use this software with a huge grain of salt. We are currently working on:

Increasing the responsiveness of the text block identifier.
Optimizing the binarization for tesseract detection.
Identifying text in multiple columns (right now, treats as one big column).
Handling tables.
Many other optimizations.

Support and Contributions

If you have feedback or would like to contribute, please, please submit a pull request or contact me at joseph dot sutherland at columbia dot edu.

Installation

To install the doc2text package, simply:

pip install doc2text

doc2text relies on the OpenCV, tesseract, and PythonMagick libraries. To execute the quick-install script, which installs OpenCV, tesseract, and PythonMagick:

curl https://raw.githubusercontent.com/jlsutherland/doc2text/master/install_deps.sh | bash

Manual installation

To install OpenCV manually:

sudo apt-get install -y build-essential
sudo apt-get install -y cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install -y python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev
git clone https://github.com/opencv/opencv.git opencv
git clone https://github.com/opencv/opencv_contrib.git opencv_contrib
cd opencv
git checkout 3.1.0
cd ../opencv_contrib
git checkout 3.1.0
cd ../opencv
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D INSTALL_PYTHON_EXAMPLES=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D BUILD_EXAMPLES=ON ..
make -j4
sudo make install
sudo ldconfig

To install tesseract manually:

sudo apt-get install tesseract-ocr

To install PythonMagick manually:

sudo apt-get install python-pythonmagick

Example usage

import doc2text

# Initialize the class.
doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

Big thanks

doc2text would be nothing without the open-source contributions of:

@danvk
@jrosebr1
Countless stackoverflow posts and comments.

Comments

Fixed issue with wrong number of variables in function return

_, contours, hierarchy = cv2.findContours(dilation, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) This ^ actually returns two arguments instead of three for me. Version differences? And please do not treat exceptions like you do - it was very hard to find out what happened because of crop() silently excepted without any signs, though leaving self.image unset, which lead to error in deskew() 'Page does not have member 'image'' Btw I managed to run the app after this commit

opened by achikin 6
Can not install pythonmagick.

I tried sudo apt-get install python-pythonmagick and pip install to ensure doc2text import well. But even I successfully installed python-pythonmagick via apt, I still can not import doc2text.

I check the source package which python-pythonmagick installed via apt, it seems the packages can only support python2.

So could you help to fix the problem? I want to doc2text on python3 (Ubuntu)

opened by dyllanwli 2

Eror on pip install PythonMagick

PythonMagick is a required package for doc2text. I installed it through pip.

(doc2txt) ➜  Programs pip install PythonMagick
Collecting PythonMagick
  Could not find a version that satisfies the requirement PythonMagick (from versions: )
No matching distribution found for PythonMagick

Anyone knows what's wrong with it...thanks.

opened by liber145 2

Add supports for lang parameter

This allow to initialize the Document class with a lang that will be passed to tesseract. (Giving tesseract a language sometimes greatly improve text extraction quality).

On ubuntu this requires to install the package tesseract-ocr-$lang$ where $lang$ is the 3 letter code for the language. On other OS, lang data for tesseract can be found at https://github.com/tesseract-ocr/langdata

opened by rcatajar 2
Does not work on python3

I installed with pip install doc2text, then tried in an ipython shell to import doc2text. This gave error in init.py line 77 because of print statement with no parantheses.

opened by lervag 2
AttributeError: 'Page' object has no attribute 'image' ISSUE
hi there I am testing your product, however I am getting this type of error:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25 dst is not a numpy array, neither a scalar Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211 dst is not a numpy array, neither a scalar Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80 dst is not a numpy array, neither a scalar Traceback (most recent call last): File "example_doc2text.py", line 19, in doc.extract_text() File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 96, in extract_text text = new.extract_text() File "/usr/local/lib/python2.7/dist-packages/doc2text/page.py", line 46, in extract_text cv2.imwrite(temp_path, self.image) AttributeError: 'Page' object has no attribute 'image'

my test files is as follow:

> import doc2text > > # Initialize the class. > doc = doc2text.Document() > > # You can pass the lang (as 3 letters code) to the class to improve accuracy > # On ubuntu it requires the package tesseract-ocr-$lang$ > # On other OS, see https://github.com/tesseract-ocr/langdata > doc = doc2text.Document(lang="eng") > > # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff. > # If reading a PDF, doc2text will split the PDF into its component pages. > doc.read('myfile.tiff') > > # Crop the pages down to estimated text regions, deskew, and optimize for OCR. > doc.process() > > # Extract text from the pages. > doc.extract_text() > text = doc.get_text() > print text

could you please help me? thanks a lot
opened by angelo337 1

Error passing the lang to the class

When I try to pass the language as in the example:

doc = doc2text.Document(lang="por")

I received the following error message:

    doc = doc2text.Document(lang="por")
TypeError: __init__() got an unexpected keyword argument 'lang'

opened by crgimenes 1

Compile opencv in /tmp

It avoids having opencv and opencv_contrib in working directory after installation. /tmp dir is cleared at boot time, but maybe we also want to manually remove the folders after installation.

Also, FYI your installation script also work with Ubuntu 14.04

opened by rcatajar 1

issue with extract_text

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()
doc.extract_text()

I get the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'

opened by rsteca 1

Fixes 'Document instance has no attribute 'file_basename''

Fixes the following issue

import doc2text doc = doc2text.Document() doc.read('image.png') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 67, in read self.file_basepath, self.file_basename + '_temp.png' AttributeError: Document instance has no attribute 'file_basename'

opened by achikin 1

AttributeError: Document instance has no attribute 'file_basename'

>>> import doc2text
>>> doc = doc2text.Document()
>>> doc.read('test.png')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jwilk/.local/lib/python2.7/site-packages/doc2text/__init__.py", line 67, in read
    self.file_basepath, self.file_basename + '_temp.png'
AttributeError: Document instance has no attribute 'file_basename'

Tested with git master (41dca91dda625b11633df77e45401787ea5a55a5).

opened by jwilk 1

Does is support stream data ?

I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on the disk . Is there any way ? I'm trying to push the stream object so its giving me the error.

code : file = request.files['file'] file_data = file.stream.read()

error:

\venv\lib\site-packages\docx2txt\docx2txt.py", line 76, in process zipf = zipfile.ZipFile(docx) File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1225, in init self._RealGetContents() File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1288, in _RealGetContents endrec = _EndRecData(fp) File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 259, in _EndRecData fpin.seek(0, 2) AttributeError: 'bytes' object has no attribute 'seek'

opened by multinucliated 0

ModuleNotFoundError: No module named 'PyPDF2'

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import doc2text
  File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in <module>
    import PyPDF2 as pyPdf
ModuleNotFoundError: No module named 'PyPDF2'

opened by alexauvray 1

Python 3.5 compatibility

Seems library not 100% python3 compatible. When I'm tying to run simple code:

import doc2text

doc = doc2text.Document()
doc = doc2text.Document(lang="eng")
doc.read('pdf-sample.pdf')

I'm getting

Traceback (most recent call last):
  File "doc2text_test.py", line 13, in <module>
    doc.read('pdf-sample.pdf')
  File "/usr/local/lib/python3.5/dist-packages/doc2text/__init__.py", line 44, in read
    for i in xrange(self.num_pages):
NameError: name 'xrange' is not defined

opened by andjelx 6

text extraction from png files does not seem to work

Thank you for this fantastic utility.

Text extraction is not successful for any png image with texts. The jpg and pdf works. Is this a known issue and will there be a fix..thanks.

opened by vsriram28 0

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Related tags

Overview

doc2text

doc2text extracts higher quality text by fixing common scan errors

doc2text is super duper alpha atm

Support and Contributions

Installation

Manual installation

Example usage

Big thanks

Comments

Owner

Joe Sutherland

Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Python library to extract tabular data from images and scanned PDFs

Extract tables from scanned image PDFs using Optical Character Recognition.

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Recognizing the text contents from a scanned visiting card

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

Indonesian ID Card OCR using tesseract OCR

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

A post-processing tool for scanned sheets of paper.

Library used to deskew a scanned document

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.